Lecture 4 - Supervised Machine Learning: KNN and Regression Flashcards
What is K-Nearest Neighbors? What are some of its characteristics?
K-Nearest Neighbors (KNN) is an algorithm that can be used for both classification and regression.
It is non-parametric: It doesn’t learn a model (It doesn’t care about the distribution of the data, and does not make any assumptions)
It has heavy memory consumption and computational cost
Explain how the KNN algorithm works
Given a training dataset X and a new instance x
Find k points in X that are closest to x
Using the selected distance measure:
Predict label for x
- Classification: Majority vote among the k nearest neighbors
- Regression: Mean of the k nearest neighbors
{Visual examples in Notion}
How can you choose K?
Some approaches include:
- k = sqrt(n)
- Loop over different values of k and compare errors similarly as with the elbow method (Sippo’s favorite)
- Use an odd k if number of classes is even
What are the advantages and disadvantages of KNN?
Advantages:
- Nonparametric
- Easy to interpret
Disadvantages:
- All features have equal importance (unlike e.g. decision trees
- With very large datasets computing distances between datapoints is infeasible
Features need to be scaled:
- A feature with a big scale can dominate all the distances
- A feature with a small scale can get neglected
The “curse of dimensionality”
- Problems with high-dimensional spaces (e.g. more than 10 features)
- Volume of space grows exponentially with dimensions
- Need exponentially more points to ‘fill’ a high-dimensional volume or you might not have any training points “near” a test point
How do you calculate the mean squared error?
The mean squared error (MSE) is calculated by:
MSE = 1/N*∑ (prediction_i - actual_i)^2
Where:
- ∑ is the summation operator
- (actual) is the actual value
- (prediction) is the predicted value
How do you calculate Root Mean Squared Error
sqrt(MSE)
What are some considerations you must have when working with linear regression?
Relationships between the dependent and independent variable(s) is not always linear
It is possible to transform data so that it will have a linear relationshop (e.g. log transform)
Collinearity: Correlation between features
Recommended additional models to study: Ridge and Lasso Regression
What is Logistic Regression?
Logistic Regression is when the dependent variable is binary (0/1 or “Yes”/”No”)
In other words, it solves binary classification problems. E.g . is an email spam or not spam
What does Logistic Regression compute?
The Logistic Regression model computes a weighted sum of the input features (Plus a bias term), but instead of outputting the result directly (like in Linear Regression), it outputs the logistic of this result
What is the logistic function?
The logistic function is a sigmoid function, which takes any real input and outputs a value between zero and one
What is the formula for the logistic function?
The logistic function is expressed as the following function:
f(x) = 1/(1 + e^-x)
What are some considerations one must have when doing logistic regression?
- Standardization or scaling of data is not needed
- Very efficient, light computationally
- Works better with datasets that have many observations
- There are very many different alternative ML algorithms for binary classification problems (e.g. SVM, Decision trees, Random Forests)
- There are variants of the logistic regression that support multiclass classification (e.g. Softmax regression / multinomial logistic regression)
TRUE OR FALSE: K-Nearest Neighbors can be used with both regression and classification problems
TRUE
TRUE OR FALSE: Logistic regression is a heavy and not that efficient algorithm for binary classification problems
FALSE
TRUE OR FALSE: Linear Regression works well when there is a linear relationship between the dependent and independent variable(s)
TRUE
What are the distances used in KNN?
Euclidian and Manhattan
Swipe to see the logic behind KNN
Given a training set X and a new instance xnew, find K points in X that are closest to xnew. Using the selected distance measure, predict a new label for xnew by majority vote (classification) or mean (regression)
What are some limitations of KNN?
all features have equal importances (unlike decision trees), computing distance is high with big datasets, features usually need to be scaled
What does regression do and how?
Regression makes predictions of continuous variables.
By teaching the model a correlation/s between one or many independent variables (x) and a dependent variable (Y)
Point out some differences between classification and regression.
- Classification is a task of predicting a discrete class label | Regression is a task of predicting a continuous quantity
- In classification, one tries to find the boundary between two classes | In regression, one tries to find the line that explains the relationship between variables
Overlap 3. A classification algorithm may predict a continuous value in the form of a probability for a class label | A regression algorithm may predict a discrete value in the form of an integer quantity
Evaluation
4. Classification predictions can be evaluated using accuracy | Regression predictions can be evaluated using root mean square error
How do you call a relationship between two variables when y increases as x increases?
How do you call a relationship between two variables when y decreases as x increases?
Positive
Negative
What does collinearity mean and why is it bad in linear regression?
Collinearity of features happens when one feature is highly correlated to another feature in a regression model
The problem is that it reduces the precision of the estimated coefficients which weakens the statistical power of the regression model
e.g. if you regress Y against X and Z (which are highly correlated between each other), then the effect of X on Y is hard to distinguish from the effect of Z on Y because any increase in X tends to be associated with an increase in Z
To fit the best line in a linear regression model you use …
… least squares
Sum of squared differences (or residuals) needs to be minimized to find a line that fits the data as well as possible
To fit (and optimize) the sigmoid line in a logistic regression you use …
(unlike linear regression where you use least square)
… maximum likelihood estimation
How is KNN trained and how do you validate the quality of the model?
TRICK QUESTION. for KNN, there is no training step because there is no model to build. … In other words, because no model is built, there is nothing to validate. But you can still test–i.e., assess the quality of the predictions using data in which the targets (labels or scores) are concealed from the model