Lecture 4 - Supervised Machine Learning: KNN and Regression Flashcards
What is K-Nearest Neighbors? What are some of its characteristics?
K-Nearest Neighbors (KNN) is an algorithm that can be used for both classification and regression.
It is non-parametric: It doesn’t learn a model (It doesn’t care about the distribution of the data, and does not make any assumptions)
It has heavy memory consumption and computational cost
Explain how the KNN algorithm works
Given a training dataset X and a new instance x
Find k points in X that are closest to x
Using the selected distance measure:
Predict label for x
- Classification: Majority vote among the k nearest neighbors
- Regression: Mean of the k nearest neighbors
{Visual examples in Notion}
How can you choose K?
Some approaches include:
- k = sqrt(n)
- Loop over different values of k and compare errors similarly as with the elbow method (Sippo’s favorite)
- Use an odd k if number of classes is even
What are the advantages and disadvantages of KNN?
Advantages:
- Nonparametric
- Easy to interpret
Disadvantages:
- All features have equal importance (unlike e.g. decision trees
- With very large datasets computing distances between datapoints is infeasible
Features need to be scaled:
- A feature with a big scale can dominate all the distances
- A feature with a small scale can get neglected
The “curse of dimensionality”
- Problems with high-dimensional spaces (e.g. more than 10 features)
- Volume of space grows exponentially with dimensions
- Need exponentially more points to ‘fill’ a high-dimensional volume or you might not have any training points “near” a test point
How do you calculate the mean squared error?
The mean squared error (MSE) is calculated by:
MSE = 1/N*∑ (prediction_i - actual_i)^2
Where:
- ∑ is the summation operator
- (actual) is the actual value
- (prediction) is the predicted value
How do you calculate Root Mean Squared Error
sqrt(MSE)
What are some considerations you must have when working with linear regression?
Relationships between the dependent and independent variable(s) is not always linear
It is possible to transform data so that it will have a linear relationshop (e.g. log transform)
Collinearity: Correlation between features
Recommended additional models to study: Ridge and Lasso Regression
What is Logistic Regression?
Logistic Regression is when the dependent variable is binary (0/1 or “Yes”/”No”)
In other words, it solves binary classification problems. E.g . is an email spam or not spam
What does Logistic Regression compute?
The Logistic Regression model computes a weighted sum of the input features (Plus a bias term), but instead of outputting the result directly (like in Linear Regression), it outputs the logistic of this result
What is the logistic function?
The logistic function is a sigmoid function, which takes any real input and outputs a value between zero and one
What is the formula for the logistic function?
The logistic function is expressed as the following function:
f(x) = 1/(1 + e^-x)
What are some considerations one must have when doing logistic regression?
- Standardization or scaling of data is not needed
- Very efficient, light computationally
- Works better with datasets that have many observations
- There are very many different alternative ML algorithms for binary classification problems (e.g. SVM, Decision trees, Random Forests)
- There are variants of the logistic regression that support multiclass classification (e.g. Softmax regression / multinomial logistic regression)
TRUE OR FALSE: K-Nearest Neighbors can be used with both regression and classification problems
TRUE
TRUE OR FALSE: Logistic regression is a heavy and not that efficient algorithm for binary classification problems
FALSE
TRUE OR FALSE: Linear Regression works well when there is a linear relationship between the dependent and independent variable(s)
TRUE