Supervised Learning Flashcards
3 main categories of machine learning
Describe Supervised Learning
observing and associating patterns of labeled data, take this training and assingn labels to new and unlabeled data.
two categories of supervised learning
Linear Regression - What variables can you change to move a line
Slope and Y intercept
Linear Regression - Describe the absolute trick
Adding values to the slope and y intercept to make the line come closer to points. The value added to the slope should be the horizontal distance(p) and the value added to the y-intercept is arbitrary, but typically use 1. Then, must down scale these added values by a learning rate so the line doesn’t overshoot the point.
Linear Regression - Describe the Square Trick
Its the absolute trick and some. Multiply the distance of the point from the line against the scaled slope and y-intercept. More smart as it gives the line a smarter distance to change to get closer to the point.
Since the point is below the line, the intercept decreases; since the point has a negative x-value, the slope increases.
If point was above line, then you would add the alpha and p*alpha
must drop the point values into the equation to determine q prime
Describe Gradient Descent
Take the derivative of an error function and move in the negative direction. The negative direction gives us the fast way to decrease the error function
Two common error functions in linear regression
Mean Absolute Error - Make all errors positive so the negatives don’t cancel each other out.
Mean Squared Error - Take all errors and square them to make them non-negative. This gives you the area of a sqaure around each point. Sum and Average than multiply by 1/2 to facilitate taking derivative
Vizualize Mean Squared Error
Visualize Mean Absolute Error
Explain Batch vs Stochastic Gradient Decent
Batch - Calculate error for all points, then update weights
Stochastic - calculate error for one point, then update weights
What type of gradient descent is used most often
Mini - Batching - Split data into mini batches of equal size, update weights based on each mini batch
Calculating error for ALL points(either by batch or one by one(stochastic) is slow
Negative Indexing - What is the difference between the following:
X = data[: , :-1]
y = data[: , -1]
X will grab all rows and all columns except the last
Y will grab all rows and just the last column
make a prediction, calculate the error, update weights and bias with gradient of error(scaled by learning rate)
What is feature scaling, two common scalings?
transforming your data into a common range of values. There are two common scalings:
Standardizing
Normalizing
Allows faster converenge, training less sensitive to the scale of features
What is standardizing
Taking each value of your column, subtracting the mean of the column, and then dividing by the standard deviation of the column.
interpreted as the number of standard deviations the original value was from the mean.
What is normalizing?
data are scaled between 0 and 1
Value - min/ max-min
Two specific cases to use feature scaling
- When your algorithm uses a distance based metric to predict.
- If you don’t, then predictions will be misleading
- When you incorporate regularization.
- if you don’t, then you unfairly punsih features with smaller or larger ranges
Describe Lasso Regularization
Allows for feature selection
Formula squishes certain coefficients to zero, while non zero coefficients inidcate relevancy
use an alpha(lambda) multiplied by sum of the absolute value of each coefficient. Adds this to error
Decision Trees - Describe Entropy
How much freedom do you have to move around
Decision Trees - Entropy described by probability
How much freedom do you have to move around or rearrange the balls
Decision Trees - Entropy describe by knowledge
Less entropy = less room to move around = more knowledge you have
Decision Trees - Entropy - Confirm how to calculate probabilities of recreating ball sequence
Since you grab the ball, and put it back each time, these are independent events and probabilities are multiplied by each other. *blue on first row should be zero
Decision Trees - Entropy - How to calculate probability of independent events if there are 5,000. Whats the downside?
Multiply each event, computationally expensive, small changes in one value can lead to large changes in outcome.
We want something more manageable
Decision Tree - Entropy - How to turn a bunch of products into sums? To make the probability calculate more manageable.
Take the log of each item and sum everything together
Decision Trees - Entropy - Why take the negative log of each probability event
Since probabilities are less than 1, the log will be negative. Thus, to turn the values to positive, we take the negative log
Decision Trees - Entropy - Once you have the sum of the negative logs, what is the next step
Take the average
Decision Trees - Entropy - Formula - Describe the formal notation
- find prob of each event
- take negative log
- multiple by occurences of event
- Take average
- Repeat for each probability
- Sum
Decision Tree - Entropy - Simplified Entropy Equation
probabilty * log of probability
sum across and take negative value.
Decision Trees - Information Gain - How to calculate?
- Change in Entropy between part node and children node
- Parent Entropy is always 1
- Take weighted average of the children
Decision Trees - Hyperparmaters - Describe Maximum Depth
largest length between the root to a leaf. A tree of maximum length kk can have at most 2k2k leaves.
Decision Trees - Hyperparameters - Describe minimum number of samples per leaf
Decision Trees - Hyperparameters - Maximum Features and Minimum Number of samples per split
min num on split - gotta have at least x amount before you can split
Maximum Features -
Decision Trees - Hyperparameters - Impact on overfitting/underfitting for small/large samples per leaf and small large depth
Large depth very often causes overfitting, since a tree that is too deep, can memorize the data. Small depth can result in a very simple model, which may cause underfitting.
Small minimum samples per leaf may result in leaves with very few samples, which results in the model memorizing the data, or in other words, overfitting. Large minimum samples may result in the tree not having enough flexibility to get built, and may result in underfitting.
Bayes Theorem - High Level Description
Involves a Prior and Posterior Probability. Use new information to update prior, this becomes the posterior.
Bayes Theorem - Known versus Inferred?
Known
You know a P(A) and you know P(R | A)
Inferred
Once we know the event R has occurred, we infer P(A | R)
Find conditional probability of event and divide into possible events that have occurred.
Bayes Theorem - Discuss Naive Bayes
Involves multiple events and assumes independence
For P(A & B), we assume events are independent. If they were depependent, they couldn’t occur together.
Think P(being HOT & Cold). This can’t happen, however, Naive says they can.
Just multiplying all events together, multiplying by the “given” and normalizing ratio.
Bayes Theorem - Naive Bayes Fip Step. Use example below
Flip the event and conditional.
P(A | B) becomes P(B|A) * P(A). Think in terms of a diagram.