Supervised Machine Learning – Regression and Classification Flashcards by Boban Djordjevic

What are 2 main types of machine learning?

Supervised and unsupervised learning

How well did you know this?

Not at all

Perfectly

What is supervised learning and what are two main types of it?

Supervised learning as a type of machine learning where model is trained with input and output data (x, y) from which model is able to predict the output y based on input x that was never used in model training. Two main types are regression and classification.

How well did you know this?

Not at all

Perfectly

What is regression?

Regression is a form of supervised learning where the model is predicting a specific number (ex. predict weight based on mouse size)

How well did you know this?

Not at all

Perfectly

What is classification?

Classification is a type of supervised learning where the model is predicting a category for an input (small set of options) – ex. predict if on an image there is a dog or a cat.

How well did you know this?

Not at all

Perfectly

What is unsupervised learning?

Unsupervised learning is a type of machine learning that tries to identify clusters or structure in unstructured data. Compared to supervised learning, there are no labels that will mark the output, you only have a bunch of input data and algorithm is trying to identify clusters without really knowing what they mean (ex. customer segmentation or defining type of people based on their genome sequence)

How well did you know this?

Not at all

Perfectly

What types of unsupervised learning do we have?

Clustering – identify clusters
Anomaly detection – ex. used for fraud detection
Dimensionality reduction – reduce a big dataset to a smaller one (compress data)

How well did you know this?

Not at all

Perfectly

What is the most common supervised algorithm that is used worldwide and how does it work?

That algorithm is linear regression. It tries to fit a straight line through the data where prediction will be located there on the line.

How well did you know this?

Not at all

Perfectly

How do you mark a specific row in the training dataset?

How well did you know this?

Not at all

Perfectly

Write down the linear regression function with one variable (unilateral)

f(x(i)) = Wx(i) + b

How well did you know this?

Not at all

Perfectly

What is the most common error function that is used to calculate parameters of linear regression ?

Squared error function

How well did you know this?

Not at all

Perfectly

How do you find values W and b in the linear regression function?

You will find it when you find a minimum of squared error function J(W,b).

How well did you know this?

Not at all

Perfectly

What is the shape of a cost function with 2 parameters and how to visualize it in 2D?

It has a bowl shape. To visualize it in 2D, you can use a contour plot.

How well did you know this?

Not at all

Perfectly

What is a gradient descent?

Gradient descent is an algorithm that is providing a structured way to minimize the function to a local minimum, in the case of linear regression minimize J(W1, W2, … Wn, b) = W1x + W2x + … Wnx + b.
It starts with some random values of parameters, calculating value of the function for all input/output values, and goes into a direction of steepest descent (adapt parameters) until it gets to the local minima.
There can be multiple local minimas for some non-linear functions like neural networks and in that case it depends from which initial parameters gradient descent has started.

How well did you know this?

Not at all

Perfectly

What is a learning rate?

It is a simple constant that decides in gradient descent how big the step would be, or how big the changes of params W and b will be in each step. Bigger steps – faster convergence but higher possibility that the algorithm will overshoot the local minimum since the step is too big. It is marked with Greek letter α. The closer you are to the local minimum, slope is smaller which leads to smaller steps (smaller derivative of function J)

How well did you know this?

Not at all

Perfectly

How to implement gradient descent?

Important thing to note that values W and b are updated simultaneously (at the same time). Incorrect way would be to update first W and then b since b will use a new value of W instead of the old one

How well did you know this?

Not at all

Perfectly

How to implement gradient descent for linear regression?

Study These Flashcards

What is the difference between batch and stochastic gradient descent?

Study These Flashcards

When calculating parameters batch gradient descent is taking the whole training set while stochastic gradient descent is just taking one random training example. SGD is faster and gives good solution but not optimal. Should be used on large datasets.

What is feature scaling?

Study These Flashcards

In most cases, values of input features can be quite different like number of bedrooms vs size in square meters. Due to this, contour plot is narrow compared to features that have similar values. That would mean that it will take much more time until the model converges. To avoid this, you should scale your features to the same number range like 0 to 1 or -1 to 1.

What are different methods to scale features?

Study These Flashcards

You can scale features by dividing it with the max value for that feature so you get all features on a scale from 0 to 1
Mean normalization to center all feature values around 0
Z-score normalization is the same like mean normalization just the features are mainly between -3 and 3

How do you speed up linear regression calculation for multiple features?

Study These Flashcards

You can do it by introducing vectorization – translate all feature and weights to vectors and use a dot product.
f(x) = w*x + b

How does a gradient descent algorithm look like in vectorized implementation?

Study These Flashcards

What is the alternative algorithm that can be used instead of gradient decent for linear regression?

Study These Flashcards

Normal equation – used only for linear regression, it works much faster when the number of features in < 10.000

What are the best practices when it comes to checking gradient descent for convergence?

Study These Flashcards

There are 2 options:
- Plot a function that has values J and # of iterations – values of J should always be decreasing
- Automatic convergence test – Define a small number like 10-3 and if programmatically decide if iteration step is less than that number it means that gradient descent has converged

What is a recommended approach to choose learning rate?

Study These Flashcards

Start with 0.001 and increase it 3 times in the next experiment. Stop when you see it is too large!

What is feature engineering?

It is a process of creating new features out of existing ones, for example creating a new feature x3 = w1w2 where w1 is longitude and w2 latitude of geolocation. Doing a proper feature engineering can increase performance of your model.

What is a polynomial regression?

Polynomial regression is a method of introducing a new feature like x2, x3 or √x to curve a linear regression to fit the training data

What is a binary classification?

It is a classification when only two values are possible, positive or negative.

Which algorithm is used for classification problems?

Logistic regression

What is a sigmoid function?

Sigmoid function is a non-linear function that passes through 0.5 y-axis when x is 0. It looks like:

How does a logistic regression algorithm look like?

What is a decision boundary?

It is a line that separates positive from negative examples. It can be linear or non-linear (if we adapt the function to be polynomial). Let’s say the value for a threshold is 0.5 where the example is positive if >= 0.5. That would mean that z is positive for positive numbers. On the other hand, negative values are < 0.5 and z is a negative number. To draw a line, since z is actually f(x), you should make linear regression f(x) equal to 0 and then draw a line from that function.

Why can’t you use a regular squared error cost function to params value in logistic regression?

Since logistic regression algorithm is different than linear regression, using the same cost function will produce a non-convex function.

How does a cost function for logistic regression look like?

Supervised Machine Learning – Regression and Classification Flashcards

(33 cards)