Supervised Machine Learning – Regression and Classification Flashcards
What are 2 main types of machine learning?
Supervised and unsupervised learning
What is supervised learning and what are two main types of it?
Supervised learning as a type of machine learning where model is trained with input and output data (x, y) from which model is able to predict the output y based on input x that was never used in model training. Two main types are regression and classification.
What is regression?
Regression is a form of supervised learning where the model is predicting a specific number (ex. predict weight based on mouse size)
What is classification?
Classification is a type of supervised learning where the model is predicting a category for an input (small set of options) – ex. predict if on an image there is a dog or a cat.
What is unsupervised learning?
Unsupervised learning is a type of machine learning that tries to identify clusters or structure in unstructured data. Compared to supervised learning, there are no labels that will mark the output, you only have a bunch of input data and algorithm is trying to identify clusters without really knowing what they mean (ex. customer segmentation or defining type of people based on their genome sequence)
What types of unsupervised learning do we have?
Clustering – identify clusters
Anomaly detection – ex. used for fraud detection
Dimensionality reduction – reduce a big dataset to a smaller one (compress data)
What is the most common supervised algorithm that is used worldwide and how does it work?
That algorithm is linear regression. It tries to fit a straight line through the data where prediction will be located there on the line.
How do you mark a specific row in the training dataset?
Write down the linear regression function with one variable (unilateral)
f(x(i)) = Wx(i) + b
What is the most common error function that is used to calculate parameters of linear regression ?
Squared error function
How do you find values W and b in the linear regression function?
You will find it when you find a minimum of squared error function J(W,b).
What is the shape of a cost function with 2 parameters and how to visualize it in 2D?
It has a bowl shape. To visualize it in 2D, you can use a contour plot.
What is a gradient descent?
Gradient descent is an algorithm that is providing a structured way to minimize the function to a local minimum, in the case of linear regression minimize J(W1, W2, … Wn, b) = W1x + W2x + … Wnx + b.
It starts with some random values of parameters, calculating value of the function for all input/output values, and goes into a direction of steepest descent (adapt parameters) until it gets to the local minima.
There can be multiple local minimas for some non-linear functions like neural networks and in that case it depends from which initial parameters gradient descent has started.
What is a learning rate?
It is a simple constant that decides in gradient descent how big the step would be, or how big the changes of params W and b will be in each step. Bigger steps – faster convergence but higher possibility that the algorithm will overshoot the local minimum since the step is too big. It is marked with Greek letter α. The closer you are to the local minimum, slope is smaller which leads to smaller steps (smaller derivative of function J)
How to implement gradient descent?
Important thing to note that values W and b are updated simultaneously (at the same time). Incorrect way would be to update first W and then b since b will use a new value of W instead of the old one
How to implement gradient descent for linear regression?
What is the difference between batch and stochastic gradient descent?
When calculating parameters batch gradient descent is taking the whole training set while stochastic gradient descent is just taking one random training example. SGD is faster and gives good solution but not optimal. Should be used on large datasets.
What is feature scaling?
In most cases, values of input features can be quite different like number of bedrooms vs size in square meters. Due to this, contour plot is narrow compared to features that have similar values. That would mean that it will take much more time until the model converges. To avoid this, you should scale your features to the same number range like 0 to 1 or -1 to 1.
What are different methods to scale features?
- You can scale features by dividing it with the max value for that feature so you get all features on a scale from 0 to 1
- Mean normalization to center all feature values around 0
- Z-score normalization is the same like mean normalization just the features are mainly between -3 and 3
How do you speed up linear regression calculation for multiple features?
You can do it by introducing vectorization – translate all feature and weights to vectors and use a dot product.
f(x) = w*x + b
How does a gradient descent algorithm look like in vectorized implementation?
What is the alternative algorithm that can be used instead of gradient decent for linear regression?
Normal equation – used only for linear regression, it works much faster when the number of features in < 10.000
What are the best practices when it comes to checking gradient descent for convergence?
There are 2 options:
- Plot a function that has values J and # of iterations – values of J should always be decreasing
- Automatic convergence test – Define a small number like 10-3 and if programmatically decide if iteration step is less than that number it means that gradient descent has converged
What is a recommended approach to choose learning rate?
Start with 0.001 and increase it 3 times in the next experiment. Stop when you see it is too large!