Supervised Learning Flashcards
3 main categories of machine learning

Describe Supervised Learning
observing and associating patterns of labeled data, take this training and assingn labels to new and unlabeled data.
two categories of supervised learning

Linear Regression - What variables can you change to move a line
Slope and Y intercept

Linear Regression - Describe the absolute trick
Adding values to the slope and y intercept to make the line come closer to points. The value added to the slope should be the horizontal distance(p) and the value added to the y-intercept is arbitrary, but typically use 1. Then, must down scale these added values by a learning rate so the line doesn’t overshoot the point.

Linear Regression - Describe the Square Trick
Its the absolute trick and some. Multiply the distance of the point from the line against the scaled slope and y-intercept. More smart as it gives the line a smarter distance to change to get closer to the point.


Since the point is below the line, the intercept decreases; since the point has a negative x-value, the slope increases.
If point was above line, then you would add the alpha and p*alpha


must drop the point values into the equation to determine q prime

Describe Gradient Descent
Take the derivative of an error function and move in the negative direction. The negative direction gives us the fast way to decrease the error function

Two common error functions in linear regression
Mean Absolute Error - Make all errors positive so the negatives don’t cancel each other out.
Mean Squared Error - Take all errors and square them to make them non-negative. This gives you the area of a sqaure around each point. Sum and Average than multiply by 1/2 to facilitate taking derivative

Vizualize Mean Squared Error

Visualize Mean Absolute Error

Explain Batch vs Stochastic Gradient Decent
Batch - Calculate error for all points, then update weights
Stochastic - calculate error for one point, then update weights

What type of gradient descent is used most often
Mini - Batching - Split data into mini batches of equal size, update weights based on each mini batch
Calculating error for ALL points(either by batch or one by one(stochastic) is slow

Negative Indexing - What is the difference between the following:
X = data[: , :-1]
y = data[: , -1]
X will grab all rows and all columns except the last
Y will grab all rows and just the last column

make a prediction, calculate the error, update weights and bias with gradient of error(scaled by learning rate)

What is feature scaling, two common scalings?
transforming your data into a common range of values. There are two common scalings:
Standardizing
Normalizing
Allows faster converenge, training less sensitive to the scale of features
What is standardizing
Taking each value of your column, subtracting the mean of the column, and then dividing by the standard deviation of the column.
interpreted as the number of standard deviations the original value was from the mean.
What is normalizing?
data are scaled between 0 and 1
Value - min/ max-min
Two specific cases to use feature scaling
- When your algorithm uses a distance based metric to predict.
- If you don’t, then predictions will be misleading
- When you incorporate regularization.
- if you don’t, then you unfairly punsih features with smaller or larger ranges
Describe Lasso Regularization
Allows for feature selection
Formula squishes certain coefficients to zero, while non zero coefficients inidcate relevancy
use an alpha(lambda) multiplied by sum of the absolute value of each coefficient. Adds this to error
Decision Trees - Describe Entropy
How much freedom do you have to move around

Decision Trees - Entropy described by probability
How much freedom do you have to move around or rearrange the balls

Decision Trees - Entropy describe by knowledge
Less entropy = less room to move around = more knowledge you have










































































