Midterm 2.... Flashcards
Convert odds of 1:8 to a probability.
1/8 = .125 –> .125/(.125 + 1) = 11.11% probability. Because 1:8 means 1 out of every 9.
Odds EQ for Probability(for/against)
odds(for/against)/odds(for/against) + 1
What does a logistic regression model predict?
LogOdds! This will have a range of (-infinity, infinity)
How do you convert logOdds to odds?
e^logOdds = odds
How do you convert logOdds to probability?
e^logOdds/(e^logOdds + 1)
What does logOdds equal in terms of ln(x)?
ln(odds) = logOdds or log(odds) = logOdds
What is the range of odds (what are they bound by?)
[0, infinity)
What is the range of logOdds (what are they bound by?)
(-infinity, infinity)
What type of estimation model is logistic regression, and why?
Class probability estimation model. It is using a numeric value to estimate the probability of a categorical variable! Ex. What is the chance Marc goes to class? 0.3
What loss function does support vector machine use?
Hinge loss
Hinge loss (loss function)
An instance on the wrong side of the line does not incur a penalty. ONLY when it’s on the wrong side and outside of the margin.
Zero-one loss
An instance incurs a loss of 0 for a correct decision and 1 for an incorrect decision.
Squared error
Specifies a loss equal to the square of the distance from the boundary. A further instance would have a greater error. Usually used for numeric value prediction rather than classification.
Loss function
Determines how much penalty should be assigned to an instance based on the model’s predictive value
Finish this sentence. Accuracy of training data is sometimes called…
In-sample accuracy (train) vs. out-of sample accuracy (test)
When is logistic regression more accurate vs. decision tree and vice versa?
LR is more accurate with a smaller data set, DT on bigger sets
What’s the point of regularization?
It gives a penalty to more complicated models because those are more prone to overfitting.
In a confusion matrix what are the column headers? Row headers?
Column: Actual y and n
Row: Predicted y and n
False positive
Predicted positive, actual negative
False negative
Predicted negative, actual positive
True negative
Predicted negative, actual negative
True positive
Predicted positive, actual positive
True positive rate
True positive / all actual positives (both true and false)
False positive rate
False positive / all actual negatives (both true and false)
Positive predictive value (PPV)
True positive / all predicted positive (both correct and incorrect)
What’s the expected value of a game of roulette? Probability of hitting black = 48%. Bet = $100
EV = (0.48)(100) + (1-0.48)(-100) = - 4
What are the two uses for expected value?
- Inform how to use our classifier for individual predictions.
- Compare classifiers.
Class priors
The proportion of positive and negative instances in your data set. Ex. 40 of 100 people would buy a new car next year if they could. p(p) = .4, p(n) = .6
Two critical conditions underlying profit calculations:
- Class priors
2. Costs and benefits
Where is the perfect point on an ROC curve (hint: x axis is FPR, y axis is TPR)
Top left. FPR of 0, TPR of 1
How are ROC curves created?
TPR and FPR are found at every cutoff point. Ex. Titanic threshold of 0 to 1, values would be found at every .01
What is the Area Under the ROC Curve used for (AUC)?
AUC is used when a single number is needed to summarize performance or when nothing
What are two alternatives to the ROC curve?
- Cumulative response curve
2. Lift curve
How is a lift curve calculated
Cumulative response curve values. y/x
How can you calculate cumulative response curve values from lift?
Lift: x axis * y axis. Contact 20% with a lift of 2.5 means it will be .5 on the cumulative response curve.
Euclidian distance
Distance formula but using it with attributes between two people.
Manhattan distance
Distance using the two axes rather than the hypotenuse
Euclidian vs. Manhattan
Euclidian uses hypotenuse of triangle (where two instances are the edges). Manhattan uses two bases.
Nearest neighbors
Judge similarity by calculating distance to nearest neighbors and using those results to make a prediction. Ex. 3 nearest neighbors –> 2 no’s, 1 yes. Instance should be a no!
How do we give weight to closest neighbors
With similarity weight: Inverse of distance squared –> contribution = sim. weight/(sum of all sim. weights)
How do we get to a probability from nearest neighbors
Sum of all “no” contributions = p(no)
How do we avoid overfitting the data with nearest neighbors?
By choosing a higher k = # of neighbors!
What are the three issues with nearest neighbors?
- Dimensionality and domain knowledge. Unimportant features might have too much influence over important ones!
- Fast to train, slow to predict. Prediction requires plotting the entire dataset.
- Easy to interpret, but no “knowledge” extracted from data.
Hierarchical clustering
Consider individual points and distance between them. Ex. Points with a Euclidian distance smaller than x will be clustered.
Link function (clustering)
Minimum req. that must be met before an item is clustered.
Centroid-based clustering
Decide k (number of centroids) and groups will be made around those. Points are grouped on which centroid they’re closest to. When a point is added the centroid is repositioned.