Statistics Flashcards
technical interview study
What is a p-value?
When testing a hypothesis, the p-value is the probability that we would observe results at least as extreme as our result due purely to random chance if the null hypothesis were true.
or…
A p-value is the probability that random chance generated the data, or something else that is equal or rarer.
What does it mean when a p-value is low?
When a p-value is low, it is relatively rare for the observed results to be purely from random chance.
Because of this, we may decide to reject the null hypothesis.
If the p-value is some pre-defined threshold, we say that the value is “statistically significant” and we reject the null hypothesis.
What value is most often used to determine statistical significance?
A value of alpha=0.05 is most often used as a threshold for statistical significance.
What are the five linear regression assumptions and how can you check them?
- Linearity: the target (y) and the features (xi) have a linear relationship.
Check linearity:
Plot the errors against the predicted yhat and look for the values to be symmetrically distributed around a horizontal line with constant variance.
- Independence: the errors are not correlated with one another.
Check independence: Plot errors over time and and look for non-random patterns (in the case of time series data).
- Normality: the errors are normally distributed.
Check normality: histogram of the errors.
- Homoskedasticity: The variance of the error terms is constant over the target features.
Check homoskedasticity: Plot the errors against the predicted yhat.
- Non-Multicollinearity: Look for pairwise-correlations > 0.80.
What are the pitfalls of using classification accuracy to assess your model?
Classification accuracy can be misleading in the case of imbalanced datasets.
For example, if 95% of targets is “1” and 5% is “0”, we can achieve 95% accuracy by simple predicting “1” for every observation in the dataset.
Obviously, this model isn’t useful despite having 95% accuracy.
What are some ways to deal with imbalanced datasets?
Resampling is a common way to deal with imbalanced datasets. Here are two possible sampling techniques:
- Use all samples from your most frequently occurring event and then randomly sample (with replacement) your more frequently occurring event until you have a balanced dataset.
- Use all samples from your less frequently occurring event and then randomly sample your more frequently occurring event (with or without replacement) until you have a balanced dataset.
What is a Type I error?
Type I error is the rejection of a true null hypothesis, or a “false positive” classification.
What is a Type II error?
Type II error is the NON-rejection of a false-null-hypothesis, or “false negative” classifications.
What is bias of a statistic?
Bias is the difference between the calculated value of the parameter and the true value of the population parameter being estimated.
e.g., if we survey homeowners on the values of the homes and only the wealthiest homeowners respond, then our “home value” estimate will be biased since it will be larger than the true value of the parameter (this is an example of sampling bias causing a biased statistic).
For machine learning models, bias refers to something slightly different: it is error caused by choosing an algorithm that cannot accurately model the signal in the data. e.g., selecting a simple linear regression to model highly non-linear data would result in error due to bias.
What is variance of a statistic?
Variance is the measurement of how spread out a set of values are from their mean.
More formally,
Var(X) = E[(X-u)^2]
What is the Central Limit Theorem?
When we draw samples of independent random variables (drawn from a single distribution with a finite variance), their sample mean tends toward the population mean and their distribution approaches a normal distribution as the sample size increases, regardless of the distribution from which the sample was drawn. Their variance will approach the population variance divided by the sample size.
e.g., let’s say we have a fair and balanced 6-sided die. The result of rolling the die has a uniform distribution on [1,2,3,4,5,6]. The average result from rolling the die is (1+2+3+4+5+6)/6 = 3.5.
…if we roll the die 10 times and average the values, then the resulting parameter will have a distribution that begins to look similar to a normal distribution centered around 3.5.
…if we roll the die 100 times and average the values, then the resulting parameter will have a distribution that looks/behaves even more similar to a normal distribution, again centered at 3.5, but now with decreased variance, etc.
What is interpolation?
Interpolation is making predictions on data that lies inside the range of the training set.
e.g., Let’s say we have a model that predicts the value of homes based on their size. Our model was trained on a data set containing homes between the values 500 and 5000 sq ft. Using this model to predict the value of a 4200 sq ft home is interpolation.
What is extrapolation and why can it be dangerous?
Extrapolation is making predictions on values outside the range of the training set.
e.g., Say we have a model that predicts the value of homes based on their size. Our model was trained on a dataset containing home prices between 500 and 5000 sq ft. Using this model to predict the value of a 6000 sq ft home is extrapolation.
Extrapolation is dangerous because we usually can’t guarantee the relationship between the target and features beyond what we’ve observed. In the example, the relationship between the square footage and home price may be “locally linear” between 500-5000 sq ft, but exponential after that, resulting in poor prediction.
Discuss the differences between frequentist and Bayesian statistics.
Both attempt to estimate a population parameter on a sample of data.
Frequentists treat the data as random and the statistic as fixed. Inferences are based on long-run infinite sampling and estimates of the parameter come in the form as point estimates or confidence intervals.
Bayesians treat the population parameter as fixed. Bayesian statistics allows/requires you to make informed guesses about the value of a parameter in the form of prior distributions. Estimates of the distribution come in the form of posterior distributions.
What is the multiple (hypothesis) testing problem and how can we compensate for it?
Multiple Hypothesis Testing occurs when we run many hypothesis tests all at once. If more than one hypothesis test is used to arrive at the same (or correlated) conclusion, our chance of making a false positive increases.
One way to compensate for this is using Bonferroni Correction. Here, we recalculate ea individual alpha to equal overall_alpha/k, where k is the number of tests, so that we don’t artificially increase the chance of false positives.
Name four discrete distributions and briefly provide an example for each one.
- Uniform: all outcomes are equally likely to occur. P(each event) = 1/n.
Example Uniform: outcome of fair die is uniform on [1,2,3,4,5,6].
- Bernoulli: Only two possible outcomes can occur. The events are complementary. P(event 1) = p; P(event 2) = 1-p.
Example Bernoulli: Outcome of a single coin flip.
- Binomial: Describes the count successes of n repeated Bernoulli trials, with ea trail having probability of success p.
Example Binomial: Outcome of multiple coin flips, e.g., after observing 2 coin flips we have P(2 heads) =.25, p(2 tails) = .25, P(1 tail, 1 head) = .50.
- Poisson: Describes the probability of k events occurring in a fixed period of time, given that ea event occurs at a constant rate and is independent of the time that the last event occurred.
Example Poisson: The number of cars that will drive past your house in the next hour.
Name 3 continuous distributions and give an example of each one.
Uniform: All outcomes are equally likely to occur. All equal-length intervals have the same likeliness to occur. Any single outcome , i.e. interval with length = 0, has likeliness of 0.
Example Uniform: Select a random real number between 0 and 10. P(X in [0,3]) = 3/10, but P(X=1) = 0.
Normal: A “bell-shaped” symmetric distribution that is described by its average and the degree to which observations deviate from the average (standard deviation).
Example Normal: heights of humans.
Beta: A probability distribution of probabilities, i.e. a distribution that represents the likeliness of a range of distributions being true when the true distribution is unknown.
Example Beta: You create a distribution of possible 3-point shooting percentages for your favorite basketball player at the start of the season to estimate his true shooting percentage over the entire season with the knowledge that he will probably have a similar percentage as last year and that a cold or hot streak at the start of the season is not necessarily representative of his “true” underlying shooting percentage for the entire season.
What is a long-tailed distribution?
A long-tailed distribution is one where there are many relatively extreme, but unique, outliers.
These distributions happen often in retail. e.g., if we looked at customers’ baskets at a grocery store, over a 1 month period, we may see there are many thousands, or even millions, of unique baskets for customers. This is because there are so many different combinations that a customer can select. And because foods are not consumed at the same rate (an other reasons), it is relatively rare to make related identical purchases.
Special techniques must be used, such as doing clustering on the tail, when dealing with long-tailed datasets in order to leverage them to train classification or other predictive models.
What is a A/B test and why is it useful?
An A/B test is a controlled experiment where two variants are tested against each other on the same response.
e.g., a company could test two different email subject lines and then measure which one has the higher click rate. Once the superior variant has been determined (through statistical significance or some preset time period or metric), all future customers will typically receive the “winning” variant.
A/B testing is useful because it allows practitioners to rapidly test variations and learn about an audiences preferences.
What is multivariate testing and why is it useful?
Multivariate testing is similar to A/B testing, but it simultaneously test more than 2 variants.
This can be useful when trying to optimize across a larger parameter space, e.g. 5 possible email subject lines, but it can take many more samples to achieve a statistically significant result.
Another potential drawback is that a relatively large audience (>50%) will receive a non-optimal variation during testing.
What is multi-armed bandit testing and why is it useful?
Multi-armed bandit (or simply “bandit” testing is similar to multivariate testing and A/B testing, but the sampling distribution for variants change gradually over time as feedback is received.
e.g., with traditional A/B tests, we could test 2 email subject lines, A and B. We would initially send out emails to 200 customers, sending 100 A variations, 100 B variations. After some set period of time, say 24 hrs, we would observe which email variant was opened by more customers. We would then send that variant to all customers going forward.
With bandit testing, we would get some learning rate for the distribution of variants to change over time. Perhaps 60 customers opened variant A emails and only 50 customers opened variant B emails. We could then shift the distribution from 50/50 to 55% A, 45% B for the next round of emails.
Using this approach, we can continuously monitor the response from our audience and shift our responses accordingly. This is particularly useful in marketing or any industry where peoples preferences and opinions may change rapidly since it continuously tests and learns new preferences and can adapt quickly.
Note from Kyle: “I love bandit testing and prefer it over A/B testing whenever possible!”
What is the bootstrap technique and what is it used for?
The bootstrap technique is a nonparametric method of learning the SAMPLING DISTRIBUTION of a PARAMETER.
Specifically, bootstrap involves sampling your entire dataset with replacement many times, at each pass calculating the statistic you’re. interested in. A distribution is constructed by building a histogram of the statistics generated from each class.
What is the probability of rolling two 6s in a row with a fair die?
P(X=6,X=6) = (1/6)(1/6) = 1/36
We roll a fair die 10 times. What is the probability that at least one of them comes up as a 3?
P(roll die 10 times and X != 3) = 1-P(X!=3)^10 = 1 - (5/6)^10
We randomly draw two cars, without replacement, from a standard deck of cards. What is the probability that both cards are kings? (there are 4 kinds in standard deck of cards)
P(A,B) = P(A) * P(B|A) = (4/52)(3/51)
Explain p-value computation to a five year old.
simple example flipping two fair coins.
recall Proba =: # outcomes of interest / total# outcomes
What is proba of getting 2 heads in a row?
.5*.5 = .25
build a proba tree and see that:
P(H,H) = .5*.5 = .25
P(H,T) = .5*.5 = .25 P(T,H) = .5*.5 = .25
P(T,T) = .5*.5 = .25
so P(one H, one T) = .25+.25 = .5
What is p-value of getting two heads in a row?
First define p-value as the proba that random CHANCE/inherent random proba generated the data/outcome UNION a proba from outcome from something else that is EQUAL or RARER .
thus, there are THREE PARTS to p-value:
part 1: random chance/inherent proba– equals P(H,H)=.25 here.
part 2: …part 1 UNION with outcome T,T which is an outcome EQUAL in proba as H,H since both outcomes have the SAME proba of occurring, i.e. P(T,T) = .P(H,H) = .25
part 3: …part 1, part 2 UNION any other outcome(s) that are more rare (i.e. have inherent proba < P(H,H) ).
p-value (H,H) = P(H,H) + P(any event equal # outcome) + P(possible outcome more extreme)
= .25 + .25 + 0 = 0.50
More complicated example flipping coin 5 times and getting 5 H.
Proba = # outcome of interest/# total outcomes P(five H) = 1/32 = .03125 P(4H, 1T) = 5/32 P(3H,2T) = 10/32 = 5/8 P(2H,3T) = 10/32 = 5/8 P(1H,4T) = 5/32 P(five T) = 1/32
p-value (five H)
= P(5 H) + P(some event equal # outcomes as 5H) + P(something fewer # outcomes than 5H)
= 1/32 + P(5 T) + 0
= 2/32 = 1/16 = .0625
Notice that p-value (5 H) = 0.0625 > alpha=0.05, so it is not all that unusual that to see 5 heads in row!
What is p-value (4T, 1H)?
p-value (4T, 1H) = P(4T,1H) + P(event with equal # outcomes) + P(event fewer # outcomes)
= P(4T,1H) + P(1T, 4H) + P(5 H) + P(5 T)
= 5/32 + 5/32 + 1/32 + 1/32
= 12/32 = 3/8 = 0.375