Week 1: Data and KNN Flashcards
What is machine learning approach?
Programming an algorithm to automatically learn from data, or from experience, uncover patterns in data, building autonomous agents
What should be emphasized in machine learning?
- Predictive performance
- Scalability
- Autonomy
Why might you want to use a learning algorithm?
- Hard to code solution by hand (vision, speech)
- System needs to adapt to a changing environment (spam detection)
- Want the system to perform better than human programmers
- Privacy/ fairness (ranking search results)
How does machine learning perform compared to humans?
It may perform better or worse than humans
Define artificial intelligence
- A subfield of CS that refers to computer programs that can solve problems humans are good at
- E.g vision, natural language
Define machine learning
A subfield of AI focused on learning (tuning parameters) from data
Define neural networks
Parametric model used in ML loosely based on biological neurons
What is deep learning?
Neural networks with multiple layers
What is data science?
An emerging field which applies ml techniques to domain-specific problems
What are some machine learning domains?
- Computer vision
- Speech recognition
- Natural Language Processing
- Recommender system
- Games
Types of machine learning
- Supervised learning
- Semi-supervised learning
- Reinforcement learning
- Unsupervised learning
What is supervised learning
- They have labeled examples of the correct behavior
- Predict unknown values of the data using other known data
- Classification (is this A or B?)
- Anomaly detection (is this weird?)
- Regression (how much/ how many)
What is semi-supervised learning
Utilizes both labeled and unlabeled data
What is reinforcement learning
Learning system which interacts with the world and learns to maximize a scalar reward signal
What is unsupervised learning
- No labeled examples, instead looking for interesting patterns in the data
- Find human interpretable and previously unknown patterns that describe the unlabeled data
- Clustering (how is data organized)
- Association rule mining (are these related?)
Why is machine learning so powerful nowadays?
- Abundance of data
- Computing power
What is the machine learning problem?
- Should I use ml on this problem?
- Gather and organize data (pre-processing, cleaning, visualizing)
- Establish a baseline
- Choosing a model
- Optimization
- Hyperparameter search
- Analyze performance and mistakes
-Iterate back to step 4 or 2
What is data?
Collection of objects and their attributes
What does a ml training set consist of?
- Inputs (vectors)
- Labels
Why do we use input vectors in machine learning?
- Algorithms need to handle lots of data
- A common strategy is mapping data to another space that is easy to manipulate (Representation)
- Vectors are a good representation since we can do linear algebra
What is regression and classification in a training set?
Regression- t is a real number
- Classification- t is an element of a discrete set
What are the classification metrics for evaluation?
Accuracy= # correct predictions/ # test instances
Error= 1 - accuracy= # incorrect predictions/ # test instances
What is similarity?
- The simplest method of learning we know
- Classifying according to similar objects you’ve seen
- aka manohorse
What happens when more data points come in to nearest neighbor?
More complicated boundaries are possible
What is nearest neighbors relationship with noise?
It is sensitive to noise or mislabeled data (class noise)
What is the solution to noisy data?
Have k-nearest neighbors vote and pick the majority
What are the steps for k-nearest neighbors?
- Calculate the distance between the new data point and all the datapoints in the set
- Identify the k points with the shortest distance to the new point, these are the k nearest neighbors
- Among the nearest neighbors, count how many points there are for each class type and pick the majority
What is k?
K determines the tradeoff between fitting the data and overfitting the data
What happens when there is a small k?
- Good at capturing fine-grained patterns
- May overfit, sensitive to local variations in training data
What happens when there is a large k?
- Makes stable predictions by averaging over lots of examples
- May underfit because model is too generalized and oversimplifies underlying patterns in the data
How do you balance k?
- Optimal k depends on the number of datapoints (n)
- As a rule of thumb, choose k=3
- k < root n
What is validation set used for?
Tuning hyperparameters
What is cross validation?
Used to estimate generalization error of a learning algorithm when the given dataset is too small for a simple train/test or train/valid split to yield accurate estimation of generalization error
What is k-fold cross validation?
- A partition of dataset is formed by splitting it into k non-overlapping subsets
- Estimate the test error by taking the average test score across k trials
- On trial i, the i-th subset is the test set, the rest is training set
What are the highlights of k-nearest neighbor?
- Simple
- No training
- Easy to justify classification to customer
- Can easily do multiclass
What are the limitations of KNN? Large dataset
- Lazy learning technique
- in training phase KNN doing nothing, so training is fast
- in time of prediction it becomes slow as large dataset comes since model has to calculate Euclidean distance from given point to all points in the dataset
What are the limitations of KNN? Curse of Dimensionality
- feature space becomes increasingly sparse as the number of dimensions (features) grows
- In high-dimensional spaces, the notion of proximity or similarity becomes less meaningful
What are the limitations of KNN? Imbalanced dataset
- the majority class typically has significantly more samples than the minority class.
- large number of neighbors from the majority class can overpower the neighbors from the minority class
- dominate the decision making process, leading to a bias towards the majority class in the predictions.