Stats & Probability Flashcards
What is a p-value?
When testing an hypothesis, the p-value is the likelihood that we would observe results at least as extreme as our result due purely to random chance if the null hypothesis were true.
What does it mean when a p-value is low?
When the p-value is low, it is relatively rare for the our results to be purely from random variations in observations.
Because of this, we may decide to reject the null hypothesis. If the p-value is below some pre-defined threshold, we say that the result is “statistically significant” and we reject the null hypothesis.
What value is most often used to determine statistical significance?
A value of alpha = 0.05 is most often used as the threshold for statistical significance.
What are the 5 linear regression assumptions and how can you check for them?
Linearity: the target (y) and the features (xi) have a linear relationship. Check: Plot the errors against the predicted y and look for the values to be symmetrically distributed around a horizontal line with constant variance
Independence: the errors are not correlated with one another. Check: plot errors over time and look for non-random patterns (in the case of time series data)
Normality: the errors are normally distributed. Check: histogram of the errors
Homoskedasticity: the variance of the error term is constant across values of the target and features. Check: plot the errors against the predicted y
No Multicollinearity. Check: look for correlations above ~0.8 between features
What are some pitfalls of using classification accuracy to assess your model?
Classification accuracy can be misleading in the case of imbalanced data sets.
For example, if 95% of my target is “1” and 5% is “0,” I can achieve 95% accuracy by predicting “1” for every observation in my data. Obviously this model ins’t useful despite having an accuracy of 95%.
What are some ways to deal with imbalanced datasets?
Resampling is a common way to deal with imbalanced datasets. Here are two possible resampling techniques:
- Use all samples from your more frequently occurring event and then randomly sample your less frequently occurring event (with replacement) until you have a balanced data set
- Use all samples from your less frequently occurring event and then randomly sample your more frequently occurring event (with or without replacement) until you have a balanced data set
What is a Type I error?
Type I error is the rejection of a true null hypothesis, or a “false positive” classifications.
What is a Type II error?
Type II error is the non-rejection of a false null hypothesis, or a “false negative” classifications.
What is bias (of a statistic)?
Bias is the difference between the calculated value of a parameter and the true value of the population parameter being estimated.
For example, if we survey homeowners on the value of their homes and only the wealthiest homeowners respond, then our “home value” estimate will be biased since it will be larger than the true value for our population. (That is an example of sampling bias causing us to have a biased statistic).
For machine learning models, bias refers to something slightly different: it is error caused by choosing an algorithm that cannot accurately model the signal in the data. For example, selecting a simple linear regression to model highly non-linear data would result in error due to bias.
What is variance (of a statistic)?
Variance is a measurement of how spread out a set of values are from their mean.
More formally, Var(X) = E[(X-u)^2]
What is the curse of dimensionality?
The curse of dimensionality refers to problems that occur when we try to use statistical methods in high-dimensional space.
As the number of features (dimensionality) increases, the data becomes relatively more sparse and often exponentially more samples are needed to make statistically significant predictions.
Imagine going from a 10x10 grid to a 10x10x10 grid… if we want one sample in each “1x1 square”, then the addition of the third parameter requires us to have 10 times as many samples (1000) as we needed when we had 2 parameters (100).
In short, some models become much less accurate in high-dimensional space and may behave erratically. Examples include: linear models with no feature selection or regularization, kNN, Bayesian models
Models that are less affected by the curse of dimensionality: regularized models, random forest, some neural networks, stochastic models (e.g. monte carlo simulations)
What is the Central Limit Theorem?
When we draw samples of independent random variables (drawn from any single distribution with a finite variance), their sample mean tends toward the population mean and their distribution approaches a normal distribution as sample size increases, regardless of the the distribution from which the random variables were drawn. Their variance will approach the population variance divided by the sample size.
For example, let’s say we have a fair and balanced 6-sided die. The result of rolling the die has a uniform distribution on [1, 2, 3, 4, 5, 6]. The average result from a die roll is (1+2+3+4+5+6)/6 = 3.5
If we roll the die 10 times and average the values, then the resulting parameter will have a distribution that begins to look similar to a normal distribution centered around 3.5.
If we roll the die 100 times and average the values, then the resulting parameter will have a distribution that looks/behaves even more similar to a normal distribution, again centered at 3.5, but now with decreased variance, etc.
What is interpolation?
Interpolation is making predictions on data that lies inside the range of the training set.
Example: let’s say we have a model that predicts the value of homes based on their size. Our model was trained on a data set containing homes between 500 and 5000 sq-ft. Using this model to predict the value of a 4200 sq-ft home is interpolation.
What is extrapolation and why can it be dangerous?
Extrapolation is making predictions on data that lies outside the range of the training set.
Example: let’s say we have a model that predicts the value of homes based on their size. Our model was trained on a data set containing homes between 500 and 5000 sq-ft. Using this model to predict the value of a 6000 sq-ft home is extrapolation.
Extrapolation is dangerous because you usually can’t guarantee the relationship between the target and features beyond what you’ve observed. In the example, the relationship between square footage and home price may be “locally linear” between 500-5000 sq. feet, but exponential after that, resulting in a poor prediction.
Discuss the differences between Bayesian and frequentist statistics.
Both attempt to estimate a population parameter based on a sample of data.
Frequentists treat the data as random and the statistic as fixed. Inferences are based on long-run infinite sampling, and estimates of the parameter come in the form of point estimates or confidence intervals.
Bayesians treat the population parameter as random and the data as fixed. Bayesian statistics allows/requires you to make informed guesses about the value of the parameter in the form of prior distributions. Estimates of the parameter come in the form of posterior distributions.
What is the multiple testing problem and how can we compensate for it?
Multiple Hypothesis Testing occurs when we run many hypothesis tests at once. If more than one hypothesis test is used to arrive at the same (or a correlated) conclusion, our chance of making a false positive goes up.
One way to compensate for it is using the Bonferroni Correction. Here, we recalculate each individual alpha to equal overall_alpha/k, where k is the number of tests, so that we don’t artificially increase the chance of false positives.
Name 4 discrete distributions and give a brief explanation and example for each one.
Uniform: All outcomes are equally likely to occur. P(each event) = 1/n
Example: Outcome of a fair and balanced die (uniform on [1, 2, 3, 4, 5, 6])
Bernoulli: Only two possible outcomes can occur. The events are complementary. P(event 1) = p, P(event 2) = 1-p
Example: Outcome of a single coin flip
Binomial: Describes the count of successes of n repeated Bernoulli trials, with each trial having a probability of success of p
Example: Outcome of multiple coin flips, e.g. after observing 2 coin flips, we have P(2 heads) = .25, P(2 tails) = .25, P(1 heads, 1 tails) = .50
Poisson: Describes the probability of k events occurring in a fixed period of time, given that each event occurs at a constant rate and is independent of the time the last event occurred.
Example: The number of cars that drive past your house in the next hour
Name 3 continuous distributions and give an example of each one.
Uniform: All outcomes are equally likely to occur. All equal-length intervals have the same likeliness to occur. Any single outcome, i.e interval with length = 0, has a likeliness of 0
Example: Select a random real number between 0 and 10. P(x in [0,3]) = 3/10, but P(x=1) = 0
Normal: A “bell shaped” symmetric distribution that is described by its average and the degree to which observations deviate from the average (standard deviation).
Example: Height of humans
Beta: A probability distribution of probabilities, i.e. a distribution that represents the likeliness of a range of distributions being true when the true distribution is unknown
Example: You create a distribution of possible 3-point shooting percentages for your favorite basketball player at the start of the season to estimate his true shooting percentage over the entire season.
What is a long-tailed distribution?
A long-tailed distribution is one where there are many relatively extreme, but unique, outliers.
These distributions happen often in retail. For example, if we looked at customers baskets at a grocery store over a 1 month period, we may see that there are many thousands, or even millions, of unique baskets for customers. This is because there are so many different item combinations that a customer can select. And because foods are not consumed at the same rate (and other reasons), it is relatively rare to make repeated identical purchases.
Special techniques must be used, such as doing clustering on the tail, when dealing with long-tail datasets in order to leverage them to train classification or other predictive models.
What is A/B testing and why is it useful?
An A/B test is a controlled experiment where two variants (A and B) are tested against each other on the same response.
For example, a company could test two different email subject lines and then measure which one has the higher open rate. Once the superior variant has been determined (through statistical significance or some preset time period or metric), all future customers will typically only receive the “winning” variant.
A/B testing is useful because it allows practitioners to rapidly test variations and learn about an audience’s preferences.
What is multivariate testing and why is it useful?
Multivariate testing is very similar to A/B testing, but it simultaneously tests more than 2 variants. This can be extremely useful when trying to optimize across a larger parameter space, e.g. 5 possible email subject lines, but it can take many more samples to achieve a statistically significant result. Another potential drawback is that a relatively large audience (>50%) will receive a non-optimal variation during testing.
What is multi-armed bandit testing and why is it useful?
Multi-armed bandit (or simply “bandit”) testing is similar to multivariate and A/B testing, but the sampling distribution for variants change gradually over time as feedback is received.
For example, with traditional A/B tests, we could test 2 email subject lines: A and B. We would initially send out emails to 200 customers, sending 100 A variations and 100 B variations. After some set period of time, say 24 hours, we would observe which email variant was opened by more customers. We would then send that variant to all customers moving forward.
With bandit testing, we would set some learning rate for the distribution of variants to change over time. Perhaps 60 customers opened the A variant emails and only 50 customers opened the B variant emails. We could then shift the distribution from (50% A, 50% B) to (55% A, 45% B) for the next round of emails.
Using this approach, we can continuously monitor the response from our audience and shift our resources accordingly. This is particularly useful in marketing or any industry where people’s preferences and opinions may change rapidly since it continuously tests and learns preferences and can adapt very quickly.
What is the bootstrap technique? What is it used for?
The bootstrap technique is a nonparametric method of learning the sampling distribution of a parameter.
Specifically, bootstrap involves sampling your entire dataset with replacement many times, at each pass calculating the statistic you’re interested in. A distribution is constructed by building a histogram of the statistics generated from each pass.
What is the probability of rolling two 6’s in a row with a fair die (a fair die has numbers 1-6 on it)?
⅙ * ⅙ = 1/36
We roll a fair die 10 times. What is the probability that at least one of them comes up as a 3?
1- (⅚)^10
In this case, it is easier to calculate 1 - P(the complement of what we want) which is 1 - P(we roll a die 10 times and never observe a 3)
We randomly draw two cards, without replacement, from a standard deck of cards. What is the probability they are both Kings?
P(A and B) = P(A) × P(B | A) = 4/52 × 3/51