Supervised Learning Flashcards

Question

Decision Trees - Entropy - Confirm how to calculate probabilities of recreating ball sequence

Answer 1

Since you grab the ball, and put it back each time, these are independent events and probabilities are multiplied by each other. \*blue on first row should be zero

Answer 2

Multiply each event, computationally expensive, small changes in one value can lead to large changes in outcome. We want something more manageable

Answer 3

Take the log of each item and sum everything together

Answer 4

Since probabilities are less than 1, the log will be negative. Thus, to turn the values to positive, we take the negative log

Answer 5

Take the average

Answer 6

1. find prob of each event 2. take negative log 3. multiple by occurences of event 4. Take average 5. Repeat for each probability 6. Sum

Answer 7

probabilty \* log of probability sum across and take negative value.

Answer 8

* Change in Entropy between part node and children node * Parent Entropy is always 1 * Take weighted average of the children

Answer 9

largest length between the root to a leaf. A tree of maximum length kk can have at most 2k2k leaves.

Answer 10

min num on split - gotta have at least x amount before you can split Maximum Features -

Answer 11

Large depth very often causes overfitting, since a tree that is too deep, can memorize the data. Small depth can result in a very simple model, which may cause underfitting. Small minimum samples per leaf may result in leaves with very few samples, which results in the model memorizing the data, or in other words, overfitting. Large minimum samples may result in the tree not having enough flexibility to get built, and may result in underfitting.

Answer 12

Involves a Prior and Posterior Probability. Use new information to update prior, this becomes the posterior.

Answer 13

Known You know a P(A) and you know P(R | A) Inferred Once we know the event R has occurred, we infer P(A | R) Find conditional probability of event and divide into possible events that have occurred.

Answer 14

Involves multiple events and assumes independence For P(A & B), we assume events are independent. If they were depependent, they couldn't occur together. Think P(being HOT & Cold). This can't happen, however, Naive says they can. Just multiplying all events together, multiplying by the "given" and normalizing ratio.

Answer 15

Flip the event and conditional. P(A | B) becomes P(B|A) \* P(A). Think in terms of a diagram.

Answer 16

Split into a product of simple factors. Then, multiply by Prob of Event Do this for all possible events (Spam & Ham)

Answer 17

Take conditional probabilities for all events(Spam and Ham), then normalize the values. (each probabilty over the sum of possible probabilities)

Answer 18

popular algorithm used for classification problems 1. Maximum Margin Classifier 2. Classification with Inseparable Classes 3. Kernel Methods

Answer 19

When linearly separating data, Margins maximize the distance from the linear boundary to the closest points (called the support vectors). Incorrectly classifed points within the Margin is the Margin Error Any errors outside the margin are considered classification errors.

Answer 20

* Split data with line that represents Mx + b = 0, * Add margin lines = Mx + b = -1, Mx + b = 1 * From Margin lines, create lines going up and down. * Find incorrectly classified points inside and out margins, * Associate a value based on point location, add all together

Answer 21

norm of the vector W squared. AKA, square all coefficients and sum. You want a small error as this indicates a larger margin.

Answer 22

* Used in distance calculation between two lines * Random vector that runs from orgin and intersects second line * Based on intersection points(p,q) and the equation of the line the vector intersects (Wx = 1), the 1/|W| square represents the distance from Wx = 0 to Wx = 1. Multiply by 2 since lines are equidistant and 2/|W| squared represents the Margin Error.

Answer 23

* C hyper-parameter determines how flexible we are willing to be with the points that fall on the wrong side of our dividing boundary * Constant that attaches itself to classification error * Large C = forcing your boundary to have fewer errors than when it is a small value. If too large, may not get converence with small error allotment * Small C = Focus on large margin

Answer 24

two lines to maximize margin

Answer 25

Use kernel trick to move from 1-D line to 2-D(Plane) where points are placed on a parabola instead of a line. Find line that cuts the parabola cleanly, equate the line to the parabola and solve. These is where SVM makes cuts.

Answer 26

The function that splits the data will be x2 + y2 = 10 (10 being in the middle of 2 and 18)

Answer 27

Transforming data from lower dimensions to higher dimensions in order to split with higher dimensional hyperplane. Then, project back to lower dimensional world with polynomial of certain degree.

Answer 28

Set of functions that will come to help us out. Linear Kernal - can only use x and y to create a line which separates data Polynomial Kernel - Add, xy x2 and y2. Can create many more functions to separate data RBF Kernel - Build mountains over each point

Answer 29

A hyperparameter we use during training to find best possible model

Answer 30

* using functions to build mountains over each point * record values in a vector of all mountains over each point. * plug them into higher dimensional space * find equation of hyperplane that splits data * Take constants of the equation of the plane * plug points at these constants and find line that splits dat (Where hyperplane intersects mountains)

Answer 31

Small = wide RBF, may underfit, may generalize better Large = narrow RBF, may overfit. Similiar to Large C in classificaiton where it attempts to classify every point correctly

Answer 32

The width of the mountain/curve

Answer 33

If gamma is large, the sigma is small(curve is narrow). Vice Versa

Answer 34

Large Gamma = trying to classify every point Small Gamma = Clusters

Answer 35

Take a bunch of models and join together to get a better model Bagging(Bootstrap aggregating) and Boosting

Answer 36

Have all our friends take a true/false test and for each question use the most common answer

Answer 37

Instead of just taking most common answer, use answers from friends who are well versed in each question. Use answer from philospher friend for philosohy question, use answer from sports friend for sports question etc

Answer 38

Weak learners = our friends who take test Strong Learner = Genius who combins all answers

Answer 39

Decision Trees

Answer 40

When a model has high bias, this means that means it doesn't do a good job of bending to the data. An example of an algorithm that usually has high bias is linear regression. Even with completely different datasets, we end up with the same line fit to the data. When models have high bias, this is bad.

Answer 41

When a model has high variance, this means that it changes drastically to meet the needs of every point in our dataset. Linear models like the one above is low variance, but high bias. A decision tree, as a high variance algorithm, will attempt to split every point into it's own branch if possible. This is a trait of high variance, low bias algorithms - they are extremely flexible to fit exactly whatever data they see.

Answer 42

Bootstrap the data - that is, sampling the data with a replacement and fitting your algorithm to the sampled data. Subset the features - in each split of a decision tree an ensemble of only a subset of the total possible features are used.

Answer 43

Take Subset of data and build decision tree of these columns. Repeat process with other random subset, then use most popular prediction as the prediction

Answer 44

They are random, there are better ways to choose which data to subset

Answer 45

Take random cuts of data(weake learners), then superimpose over each other and vote(if two or more are red then red, two or more are blue, then blue) Model will cut data according to votes

Answer 46

1. Split data to minimize errors 2. punish misclassified points and use a weak learner to focus on these points. Fit this line 3. Repeat step 2 for first weak learner 4. Combine and fit

Answer 47

* weight all data points at 1 * minimize sum of weights of incorrectly classified points * After first cut, calculate weight by taking natural log of correct/incorrect * multiply incorrect weights by the weight * Repeat

Answer 48

Superimpose all weak learner models For each weak learner model, input the positive and negative weight value accordingly. Where sum for each region is positive, then classify positive, where negative, classify negative

Answer 49

base\_estimator: The model utilized for the weak learners (Warning: Don't forget to import the model that you decide to use for the weak learner). n\_estimators: The maximum number of weak learners used.

Answer 50

If your accuracy is high, but your not detecting errors. Can occur when data is skewed with high number of positive versus low number of errors

Answer 51

Accuracy of Diagnosed Positive Group

Answer 52

Accuracy of Positive Group

Answer 53

The harmonic mean of recall and precision Will always be lower than arithmetic mean, so which ever score is lower, it will be closer to that, and thus raise a "red flag"

Answer 54

Used when you want your model to care more about either precision or recall. Its a weight added to the F1 score to swing the value either way

Answer 55

need a high recall, so need a high beta.

Answer 56

Rows = Positive versus negative Columns = Guessed Positive, Guessed Negative

Answer 57

provide a score that shows how well we split the data. 1 for perfect, .5 for random and above .5 for anything else

Answer 58

Area under ROC.

Answer 59

Calculate True Positive and False Positive Rates for all splits of data. Then plot

Answer 60

comparing model MSE from basic model MSE. The idea is that the model MSE should be lower than basic model MSE. If so, the ratio is small and 1 - ratio is close to 1.

Answer 61

One one end, your model underfits the data(high bias and doesn't do well on either training or validation) On other end, your model overfits(high variance, too complex, fits training data well but doesn't generalize well In middle, your model does pretty good on training and validation. Look for models where validation error is increasing but training error is reducing.

Answer 62

Split data into Training and Testing K Times. Each pass the bucket of training and testing is different. Take average result for all runs

Answer 63

Instead of equal splits, randomly creating different training and testing buckets K times

Answer 64

Method of detecting if model is overfitting or underfitting. As more data points are used, training error increases and CV error descreases. Look at convergence point to determine if model is overfitting or underfitting

Answer 65

so very large and very small values do not negatively affect the performance of a learning algorithm. Using a logarithmic transformation significantly reduces the range of values caused by outliers. Care must be taken when applying this transformation however: The logarithm of 0 is undefined, so we must translate the values by a small amount above 0 to apply the the logarithm successfully.

Answer 66

Subtle, but you only need two columns since there are only 3 possibilities. If you include all three, you are duplicating data and certain models may have trouble

Answer 67

first transform images to tensors, then convert pixel values from 0-1 range to a -1 to 1 range You are subtracting the mean(0.5) from each color channel(3), then dividing (0.5) from each color channel. Ensures variance is zero centered which makes learning easier

Answer 68

Each run through the network will use 64 images, then matrix where first column is a flattened vector version of one image

Answer 69

asks which dimension of the tensor dim 0 - the batch size dim 1 - vector of images. This is the dim we want

Supervised Learning Flashcards

(102 cards)