Statistics Flashcards

technical interview study

1
Q

What is a p-value?

A

When testing a hypothesis, the p-value is the probability that we would observe results at least as extreme as our result due purely to random chance if the null hypothesis were true.

or…

A p-value is the probability that random chance generated the data, or something else that is equal or rarer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does it mean when a p-value is low?

A

When a p-value is low, it is relatively rare for the observed results to be purely from random chance.

Because of this, we may decide to reject the null hypothesis.

If the p-value is some pre-defined threshold, we say that the value is “statistically significant” and we reject the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What value is most often used to determine statistical significance?

A

A value of alpha=0.05 is most often used as a threshold for statistical significance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the five linear regression assumptions and how can you check them?

A
  1. Linearity: the target (y) and the features (xi) have a linear relationship.

Check linearity:
Plot the errors against the predicted yhat and look for the values to be symmetrically distributed around a horizontal line with constant variance.

  1. Independence: the errors are not correlated with one another.

Check independence: Plot errors over time and and look for non-random patterns (in the case of time series data).

  1. Normality: the errors are normally distributed.

Check normality: histogram of the errors.

  1. Homoskedasticity: The variance of the error terms is constant over the target features.

Check homoskedasticity: Plot the errors against the predicted yhat.

  1. Non-Multicollinearity: Look for pairwise-correlations > 0.80.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the pitfalls of using classification accuracy to assess your model?

A

Classification accuracy can be misleading in the case of imbalanced datasets.

For example, if 95% of targets is “1” and 5% is “0”, we can achieve 95% accuracy by simple predicting “1” for every observation in the dataset.

Obviously, this model isn’t useful despite having 95% accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some ways to deal with imbalanced datasets?

A

Resampling is a common way to deal with imbalanced datasets. Here are two possible sampling techniques:

  1. Use all samples from your most frequently occurring event and then randomly sample (with replacement) your more frequently occurring event until you have a balanced dataset.
  2. Use all samples from your less frequently occurring event and then randomly sample your more frequently occurring event (with or without replacement) until you have a balanced dataset.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a Type I error?

A

Type I error is the rejection of a true null hypothesis, or a “false positive” classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a Type II error?

A

Type II error is the NON-rejection of a false-null-hypothesis, or “false negative” classifications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is bias of a statistic?

A

Bias is the difference between the calculated value of the parameter and the true value of the population parameter being estimated.

e.g., if we survey homeowners on the values of the homes and only the wealthiest homeowners respond, then our “home value” estimate will be biased since it will be larger than the true value of the parameter (this is an example of sampling bias causing a biased statistic).

For machine learning models, bias refers to something slightly different: it is error caused by choosing an algorithm that cannot accurately model the signal in the data. e.g., selecting a simple linear regression to model highly non-linear data would result in error due to bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is variance of a statistic?

A

Variance is the measurement of how spread out a set of values are from their mean.

More formally,

Var(X) = E[(X-u)^2]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the Central Limit Theorem?

A

When we draw samples of independent random variables (drawn from a single distribution with a finite variance), their sample mean tends toward the population mean and their distribution approaches a normal distribution as the sample size increases, regardless of the distribution from which the sample was drawn. Their variance will approach the population variance divided by the sample size.

e.g., let’s say we have a fair and balanced 6-sided die. The result of rolling the die has a uniform distribution on [1,2,3,4,5,6]. The average result from rolling the die is (1+2+3+4+5+6)/6 = 3.5.

…if we roll the die 10 times and average the values, then the resulting parameter will have a distribution that begins to look similar to a normal distribution centered around 3.5.

…if we roll the die 100 times and average the values, then the resulting parameter will have a distribution that looks/behaves even more similar to a normal distribution, again centered at 3.5, but now with decreased variance, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is interpolation?

A

Interpolation is making predictions on data that lies inside the range of the training set.

e.g., Let’s say we have a model that predicts the value of homes based on their size. Our model was trained on a data set containing homes between the values 500 and 5000 sq ft. Using this model to predict the value of a 4200 sq ft home is interpolation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is extrapolation and why can it be dangerous?

A

Extrapolation is making predictions on values outside the range of the training set.

e.g., Say we have a model that predicts the value of homes based on their size. Our model was trained on a dataset containing home prices between 500 and 5000 sq ft. Using this model to predict the value of a 6000 sq ft home is extrapolation.

Extrapolation is dangerous because we usually can’t guarantee the relationship between the target and features beyond what we’ve observed. In the example, the relationship between the square footage and home price may be “locally linear” between 500-5000 sq ft, but exponential after that, resulting in poor prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Discuss the differences between frequentist and Bayesian statistics.

A

Both attempt to estimate a population parameter on a sample of data.

Frequentists treat the data as random and the statistic as fixed. Inferences are based on long-run infinite sampling and estimates of the parameter come in the form as point estimates or confidence intervals.

Bayesians treat the population parameter as fixed. Bayesian statistics allows/requires you to make informed guesses about the value of a parameter in the form of prior distributions. Estimates of the distribution come in the form of posterior distributions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the multiple (hypothesis) testing problem and how can we compensate for it?

A

Multiple Hypothesis Testing occurs when we run many hypothesis tests all at once. If more than one hypothesis test is used to arrive at the same (or correlated) conclusion, our chance of making a false positive increases.

One way to compensate for this is using Bonferroni Correction. Here, we recalculate ea individual alpha to equal overall_alpha/k, where k is the number of tests, so that we don’t artificially increase the chance of false positives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Name four discrete distributions and briefly provide an example for each one.

A
  1. Uniform: all outcomes are equally likely to occur. P(each event) = 1/n.

Example Uniform: outcome of fair die is uniform on [1,2,3,4,5,6].

  1. Bernoulli: Only two possible outcomes can occur. The events are complementary. P(event 1) = p; P(event 2) = 1-p.

Example Bernoulli: Outcome of a single coin flip.

  1. Binomial: Describes the count successes of n repeated Bernoulli trials, with ea trail having probability of success p.

Example Binomial: Outcome of multiple coin flips, e.g., after observing 2 coin flips we have P(2 heads) =.25, p(2 tails) = .25, P(1 tail, 1 head) = .50.

  1. Poisson: Describes the probability of k events occurring in a fixed period of time, given that ea event occurs at a constant rate and is independent of the time that the last event occurred.

Example Poisson: The number of cars that will drive past your house in the next hour.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Name 3 continuous distributions and give an example of each one.

A

Uniform: All outcomes are equally likely to occur. All equal-length intervals have the same likeliness to occur. Any single outcome , i.e. interval with length = 0, has likeliness of 0.

Example Uniform: Select a random real number between 0 and 10. P(X in [0,3]) = 3/10, but P(X=1) = 0.

Normal: A “bell-shaped” symmetric distribution that is described by its average and the degree to which observations deviate from the average (standard deviation).

Example Normal: heights of humans.

Beta: A probability distribution of probabilities, i.e. a distribution that represents the likeliness of a range of distributions being true when the true distribution is unknown.

Example Beta: You create a distribution of possible 3-point shooting percentages for your favorite basketball player at the start of the season to estimate his true shooting percentage over the entire season with the knowledge that he will probably have a similar percentage as last year and that a cold or hot streak at the start of the season is not necessarily representative of his “true” underlying shooting percentage for the entire season.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is a long-tailed distribution?

A

A long-tailed distribution is one where there are many relatively extreme, but unique, outliers.

These distributions happen often in retail. e.g., if we looked at customers’ baskets at a grocery store, over a 1 month period, we may see there are many thousands, or even millions, of unique baskets for customers. This is because there are so many different combinations that a customer can select. And because foods are not consumed at the same rate (an other reasons), it is relatively rare to make related identical purchases.

Special techniques must be used, such as doing clustering on the tail, when dealing with long-tailed datasets in order to leverage them to train classification or other predictive models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a A/B test and why is it useful?

A

An A/B test is a controlled experiment where two variants are tested against each other on the same response.

e.g., a company could test two different email subject lines and then measure which one has the higher click rate. Once the superior variant has been determined (through statistical significance or some preset time period or metric), all future customers will typically receive the “winning” variant.

A/B testing is useful because it allows practitioners to rapidly test variations and learn about an audiences preferences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is multivariate testing and why is it useful?

A

Multivariate testing is similar to A/B testing, but it simultaneously test more than 2 variants.

This can be useful when trying to optimize across a larger parameter space, e.g. 5 possible email subject lines, but it can take many more samples to achieve a statistically significant result.

Another potential drawback is that a relatively large audience (>50%) will receive a non-optimal variation during testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is multi-armed bandit testing and why is it useful?

A

Multi-armed bandit (or simply “bandit” testing is similar to multivariate testing and A/B testing, but the sampling distribution for variants change gradually over time as feedback is received.

e.g., with traditional A/B tests, we could test 2 email subject lines, A and B. We would initially send out emails to 200 customers, sending 100 A variations, 100 B variations. After some set period of time, say 24 hrs, we would observe which email variant was opened by more customers. We would then send that variant to all customers going forward.

With bandit testing, we would get some learning rate for the distribution of variants to change over time. Perhaps 60 customers opened variant A emails and only 50 customers opened variant B emails. We could then shift the distribution from 50/50 to 55% A, 45% B for the next round of emails.

Using this approach, we can continuously monitor the response from our audience and shift our responses accordingly. This is particularly useful in marketing or any industry where peoples preferences and opinions may change rapidly since it continuously tests and learns new preferences and can adapt quickly.

Note from Kyle: “I love bandit testing and prefer it over A/B testing whenever possible!”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the bootstrap technique and what is it used for?

A

The bootstrap technique is a nonparametric method of learning the SAMPLING DISTRIBUTION of a PARAMETER.

Specifically, bootstrap involves sampling your entire dataset with replacement many times, at each pass calculating the statistic you’re. interested in. A distribution is constructed by building a histogram of the statistics generated from each class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the probability of rolling two 6s in a row with a fair die?

A

P(X=6,X=6) = (1/6)(1/6) = 1/36

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

We roll a fair die 10 times. What is the probability that at least one of them comes up as a 3?

A

P(roll die 10 times and X != 3) = 1-P(X!=3)^10 = 1 - (5/6)^10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

We randomly draw two cars, without replacement, from a standard deck of cards. What is the probability that both cards are kings? (there are 4 kinds in standard deck of cards)

A

P(A,B) = P(A) * P(B|A) = (4/52)(3/51)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Explain p-value computation to a five year old.

A

simple example flipping two fair coins.

recall Proba =: # outcomes of interest / total# outcomes

What is proba of getting 2 heads in a row?
.5*.5 = .25

build a proba tree and see that:
P(H,H) = .5*.5 = .25

P(H,T) = .5*.5 = .25
P(T,H) = .5*.5 = .25

P(T,T) = .5*.5 = .25

so P(one H, one T) = .25+.25 = .5

What is p-value of getting two heads in a row?

First define p-value as the proba that random CHANCE/inherent random proba generated the data/outcome UNION a proba from outcome from something else that is EQUAL or RARER .

thus, there are THREE PARTS to p-value:

part 1: random chance/inherent proba– equals P(H,H)=.25 here.

part 2: …part 1 UNION with outcome T,T which is an outcome EQUAL in proba as H,H since both outcomes have the SAME proba of occurring, i.e. P(T,T) = .P(H,H) = .25

part 3: …part 1, part 2 UNION any other outcome(s) that are more rare (i.e. have inherent proba < P(H,H) ).

p-value (H,H) = P(H,H) + P(any event equal # outcome) + P(possible outcome more extreme)
= .25 + .25 + 0 = 0.50

More complicated example flipping coin 5 times and getting 5 H.

Proba = # outcome of interest/# total outcomes 
P(five H) = 1/32 = .03125
P(4H, 1T) = 5/32
P(3H,2T) = 10/32 = 5/8
P(2H,3T) = 10/32 = 5/8
P(1H,4T) = 5/32 
P(five T) = 1/32

p-value (five H)

= P(5 H) + P(some event equal # outcomes as 5H) + P(something fewer # outcomes than 5H)

= 1/32 + P(5 T) + 0

= 2/32 = 1/16 = .0625

Notice that p-value (5 H) = 0.0625 > alpha=0.05, so it is not all that unusual that to see 5 heads in row!

What is p-value (4T, 1H)?

p-value (4T, 1H) = P(4T,1H) + P(event with equal # outcomes) + P(event fewer # outcomes)

= P(4T,1H) + P(1T, 4H) + P(5 H) + P(5 T)
= 5/32 + 5/32 + 1/32 + 1/32
= 12/32 = 3/8 = 0.375

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is an A/B test?

A

An A/B test is an experiment with two groups to establish which of two treatments, products, procedures, or the like is superior.

Often one of the two treatments is the standard existing treatment, or no treatment (placebo). If a standard (or placebo) treatment is used, it is called the CONTROL. A typical HYPOTHESIS is that treatment is BETTER than control.

A proper A/B test has SUBJECTS that can be assigned to one treatment or another. The key is that the subject is EXPOSED to the TREATMENT. Ideally, subjects are randomized to treatments. In this way, you know that any difference between treatment groups is due to one of two things:

  • the EFFECT of different treatments
  • LUCK of the draw such that subjects the random assignment may have resulted in the naturally better-performing subjects being concentrated in A or B.

We must define a metric (test statistic) to compare group A to group B. Perhaps the most common metric in data science is a binary variable: click/no-click, buy/no-buy, fraud/no-fraud,…etc. Binary metric results may be summed u p in a 2x2 outcome table.

28
Q

Explain why a control group is necessary in A/B testing

A

Why not skip the control group and just run an experiment applying the treatment of interest to only one group, and compare the outcome to prior experience?

Without a control group, there is no assurance that “other things are equal” and that any difference is really due to the treatment (or chance).

When we have a control group, it is subject to the same conditions (except for treatment of interest) as the treatment group. If we simply make a comparison to “baseline” or prior experience, other factors, besides treatment, might differ.

Furthermore, in a standard A/B test,we need to decide on one metric ahead of time. Multiple behavior metrics might be collected and be of interest, but if the experiment is expected to lead to a decision between treatment A and treatment B, a SINGLE metric (test statistic) must be established BEFOREHAND, else we risk the potential for researcher BIAS.

29
Q

What is the Binomial Distribution?

A

A binomial experiment, B(n,p), consists of a fixed number of Bernoulli trials.

The binomial dist is the FREQUENCY dist if the NUMBER OF SUCCESSES (x) in a given number of trials (n) with SPECIFIED proba (p) of SUCCESSES in each trial.

There is a family of binomial distributions, depending on values of x, n, p. The bin dist would be a question like:

“if the proba of a click converting to a sale is .02, what is the proba if observing 0 sales in 200 clicks?”

Theorem: The proba of exactly k successes in a binomial experient B(n,p) is given by

P(k) = P(k successes) = C(n,k) p^k q^(n-k)

The proba of >=1 successes is 1-q^n

where n choose k, C(n,k), is the “binomial coef”

The mean of binom dist is np; you cab also think of this as the expected number of SUCCESSES for each trial is np.

The variance of a binom dist is np(1-p) = np*q.

With a large enough number of trials (particularly when p is close to .50), the binom dist is virtually indistinguishable from the normal dist! In fact, calculating a binom dist is computationally demanding, and most stat procedures use the normal dist, with mean and var, as an approximation.

30
Q

What is a Bernoulli trial?

A

A Bernoulli trial repeats the following:

We have an experiment “e” with two outcomes, one called success (S) and the other called failure (F). Let p denote the proba of success and let q = (1-p) denote the proba of a failure.

For the experiment “e”, the trials are independent.

31
Q

The proba that Ann hits a target at any time is p = 1/3 and misses with proba q=2/3.

Ann fires at the target 7 times. What is the proba that she hits the target (a) exactly 3 times; (b) at least 1 time?

A

This is a binomial experiment. By the binomial proba,

P(k) = P(k successes) = C(n,k) p^k q^(n-k)

(a) the proba of k=3 successes is

P(k=3) = C(7,3) (1/3)^3 (2/3)^4 = (560/2187) = 0.26

(b) the proba of one or more (k>=1) successes is

the proba of NEVER hitting a target is

P(k=0) = q^7 = (2/3)^7 = 128/2187 ~= .06

so, P(at least one hit) = 1-q^7

= (2187/2187 - 128/2187) = (2059/2187) = .94

32
Q

What is a normal random variable?

A

A r.v. X is normal if it’s density function f(x) has a bell-shaped curve and is of the form

f(x) = 1/sqrt(2pisigma) exp[-.5 (x-mu /sigma)^2]

The normal dist depends on params mu,sigma and is denoted as N(mu, sigma^2)

33
Q

What is a standard normal distribution?

A

Suppose X is any normal dist N(mu, sigma^2).

The STANDARDIZED r.v. corresponding to X is defined by

Z = X-mu /sigma

Z is also normal dist’d and that mu=0 and sigma=1, s.t.

Z ~ N(0,1)

The density function for Z, obtained by setting z = (x-mu) /sigma in the density for N(mu, sigma^2) is:

phi(z) = 1/sqrt(2pi exp(-z^2 /2)

where phi(z) is the area under the std norm density curve.

The percentages under the std. norm density curve give rise to the 68-95-99.7 rule:

  1. 2% for -1<= z <= 1
  2. 4% for -2<= z <= 2
  3. 7% for -3<= z <= 3

This rule says that, in a norm dist’d population, 68% of the pop. falls within 1 sd of the mean, 95% falls within 2 sd of the mean, and 99.7% falls within 3 sd of the mean.

34
Q

What is recall (sensitivity)?

A

Recall is the proportion of true 1s (y=1) correctly classified.

synonym: Sensitivity

Recall = TP / (actual y=1)

Recall is horizontal first row in confusion matrix

35
Q

What is Precision?

A

Precision is the proportion of predicted 1s (yhat=1) that are actually 1s.

36
Q

What is an ROC Curve?

A

Receiver Operating CharacteristicsROC is a plot of recall (sensitivity) on the y-axis vs. specificity (False positive rate) on the x-axis.

The ROC curve shows the tradeoff between recall and specificity as one changes the decision threshold for the positive class.

37
Q

What is Specificity?

A

Specificity is the proportion of true 0s (y=0) correctly classified

38
Q

What is a confusion matrix?

A

A confusion matrix is a tabular display of the record counts by their predicted and actual classification status.

The PREDICTED outcomes are the columns.

The TRUE outcomes are the rows.

The diagonal elements of the matrix show the CORRECT predictions.

The off-diagonal elements show the number of INCORRECT predictions.

39
Q

What is standard deviation?

A

Standard deviation corresponds to how wide the normal pdf curve is around the mean (center of curve).

i.e. the std tells us how the data are spread around the mean.

40
Q

Explain how to use the normal distribution curve to compute probabilities

A

Consider the normal distribution pdf curve as a (continuous) approximation of a (discrete) histogram.

e.g. say center of curve (mean) is 20. If we want to know the rhs area under the normal curve with x>=30, then obtain the area under the curve for all values x>=30, and divide by total area, 1, under the normal pdf curve.

The rhs upper portion of x>=30 is a fraction of the area under the curve.

The proportional area where x>=30 divided by the total area is then a probability of observing observations with x>=30 occurring!

41
Q

How does a sample relate to a population in statistics?

A

Since we rarely have enough time/money to measure everything in an entire population, we almost always measure the population PARAMETERS (i.e. population mu, sigma) using a relatively SMALL sample.

We might poll 5 people among NYC 8mm population to ESTIMATE the POPULATION PARAMETERS.

The reason why we want to know the POPULATION PARAMETERS is to ensure that the results drawn from our experiment are REPRODUCIBLE.

i.e. if someone else takes a separate sample of 5 people among NYC 8mm, then they will obtain 5 DIFFERENT sample mu, sigma estimates from the SAME population.

Every time we do a new sample, we get different values for the parameters mu, sigma.

note: the fewer/greater number of obs in the samples results in worse/better estimates of the population parameters. This means the more data we have, the more CONFIDENCE we can have in the accuracy of the estimates.

One of the main goals of statisticians is to quantify how much CONFIDENCE we have can have in population ESTIMATES. Confidence intervals and p-values are used to quantify the confidence in estimated params. These metrics tell us that while the pop estimates are different for each sample, they may not be SIGNIFICANTLY different from each other–so that we should be able to REPLICATE the results BETWEEN SAMPLES!

From a machine learning perspective, image the 5 sample obs are the TRAINING set, and the normal pdf curve that represents the population is what we want to PREDICT, and generalize well, with our ML method.

42
Q

How do we calculate an estimate of the POPULATION variance?

A

actual POP var is:

sigma^2 = SUM(X - mu) /n

where mu is the POP mean and X is a vector of xi observations.

The result of pop var, unfortunately, is in squared units, so that its value cannot be directly related to the norm dist curve. We can fix this by just taking the sqrt to get the pop std, which can be plotted on the norm curve plot.

Since we usually never have access to all of the pop data, an ESTIMATE (calculated) of the pop variance is:

Sum(X - xbar)^2 / n-1

…dividing by n-1 compensates for the fact that we are calculating diffs from the sample mean INSTEAD for the pop mean., otherwise, we would consistently UNDERESTIMATE the var around the pop mean. This is because the diffs between data and the sample mean tend to be smaller than the diffs between the data and the pop mean. i.e.

Sum(X-xbar)^2 /n-1 < Sum(X-mu)^2 /n

Thus, the diffs around the pop mean will result in a larger average, and the larger average is what we are trying to estimate.

43
Q

Explain what a hypothesis test is

A

Hypothesis test (significance tests) are ubiquitous in traditional stats analysis of published research. Their purpose is to help us learn WHETHER RANDOM CHANCE MIGHT BE RESPONSIBLE FOR AN OBSERVED EFFECT.

An A/B test is typically constructed with a hypothesis in mind. e.g., the hypothesis might be that price B produces higher profit. Why do we need a hypothesis? Why not just look at the outcome of the experiment and go with whichever does better?

The answer lies in the tendency of the human mind to UNDERESTIMATE THE SCOPE OF NATURAL RANDOM BEHAVIOR.

One manifestation of this is the failure to anticipate extreme events (black swans). Another manifestation is the tendency to MISINTERPRET RANDOM EVENTS AS HAVING PATTERNS OF SOME SIGNIFICANCE.

Statistical hypothesis testing was invented as a way to PROTECT RESEARCHERS FROM BEING FOOLED BY RANDOM CHANCE.

In a properly designated A/B test, you collect data on treatments A and B in such a way that any observed difference between A and B must be due to either:

  • random chance in assignment of subjects
  • a true difference between A and B.

A stat hypothesis test is further analysis of an A/B test, or any randomized experiment to asses whether random chance is a reasonable explanation for the observed difference between groups A and B.other

44
Q

Explain what a null hypothesis is

A

Hypothesis tests use the following logic:

“given the human tendency to react to unusual but RANDOM behavior and interpret it as something meaningful and real, in our experiments we will require PROOF that the difference between groups is more EXTREME than what CHANCE MIGHT PRODUCE.”

This involves a baseline assumption that the treatments are equivalent, and ANY DIFFERENCE BETWEEN THE TWO GROUPS IS DUE TO CHANCE.

This baseline assumption is termed the NULL HYPOTHESIS.

Our hope is then that we can, in fact, prove the null hypothesis WRONG, and show that the outcome for groups A and B are MORE DIFFERENT THAN WHAT CHANCE MIGHT PRODUCE. One way to to do this is via a RESAMPLING PERMUTATION procedure, in which we shuffle together result from A,B and then repeatedly deal out the data in groups of similar sizes, then observe HOW OFTEN we get a difference AS EXTREME as the observed difference.

45
Q

Explain what an Alternative Hypothesis test is

A

Hypothesis tests by their nature involve nit just a null hypothesis, but also and OFFSETTING ALTERNATIVE hypothesis. e.g.,

Null: “no diff xbar_a vs. xbar_b:
Alt: “A is different than B” (could be bigger or smaller)

Null: “A <= B”
Alt: : “A>B”

Null: “B is not x% greater than A”
Alt : “B is x% greater than A”

46
Q

When should we use a one-way hypothesis test?

A

In A/B testing, we test a new option (B) vs. an established option (A) and the presumption is that we will keep with A unless B proves to be significantly better.

In such case, we want a hyp test to protect against being FOOLED BY CHANCE IN THE DIRECTION OF B. We don’t care about being fooled in the direction of A because we’d be sticking with A unless B poves significantly better. So we want a DIRECTIONAL ALTERNATIVE hypothesis (B is better than A) and we use a ONE-WAY (one tail) hyp test. This means that extreme chance results in only one direction count towards the p-value.

47
Q

When should we use a two-way hypothesis test?

A

If we want a hyp test to protect us from being fooled by chance in either direction, the alt hyp is BIDIRECTIONAL (A is different from B; either bigger or smaller). In such cases, we use a TWO-WAY (2 tail) hypothesis. This means that EXTREME CHANCE RESULTS in either direction count towards the p-value.

48
Q

What does statistical significance mean?

A

stat significance is how statisticians measure an experiment yields a result MORE EXTREME THAN WHAT CHANCE MIGHT PRODUCE.

If the result is BEYOND the realm of CHANCE VARIATION, it is said to be STATISTICALLY SIGNIFICANT.

e.g. say price A converts custs almost 5% better than price B (.8425% vs. .8057%, a diff of .0368 pct pts) and we have 45k obs.

We can test whether the diff in conversions of A vs. B is within the realm of CHANCE VARIATION, using a resampling procedure to simulate real-world events:

  1. create an urn with all sample results: 382 ones, 45945 zeros; .008246 conversion rate
  2. shuffle and draw a resample of size 23,739 (same n as price A) and record the count of 1s.
  3. Record the number of 1s in the remaining 22,588 (same n as price B).
  4. Record the diff in proportion 1s.
  5. Repeat steps 2 to 4.
  6. How often was the difference >= 0.0368?

We can plot a hist of, say 1000, resampling outcomes above and see that the observed diff .0368% falls well within the range of chance variation.

49
Q

What is a p-value?

A

p-value is the FREQUENCY with which the CHANCE MODEL produces a result MORE EXTREME than the OBSERVED RESULT.

We can estimate the p-value from a permutation test by taking the PROPORTION OF TIMES that the permutation test produces a difference EQUAL TO OR GREATER than the OBSERVED difference.

A p-value of .308 means that we would EXPECT to achieve a result AS EXTREME, OR MORE EXTREME than this observed outcome BY RANDOM CHANCE 30.8% of the time.

Instead of a permutation test, since a binary outcome experiment is binomially distributed, we can APPROXIMATE the BINOMIAL DISTRIBUTION by the NORMAL distribution

50
Q

Explain the difference between type I and type II errors

A

In assessing stat significance, two types of errors are possible:

Type I: we MISTAKENLY conclude an effect is REAL, when it is really DUE TO CHANCE

Type II: we MISTAKENLY conclude and effect is NOT REAL, when in fact it IS REAL

Recall the basic function of significance (hypothesis) tests is to protect against BEING FOOLED BY RANDOM CHANCE; thus these test are typically structured to MINIMIZE Type 1 errors.

51
Q

What is a t-test?

A

In the 1920s when stat tests were being developed, it was INFEASIBLE to do a resampling test (1000s of shuffled iterations). Statisticians found a good APPROXIMATION to the shuffled permutation was the T-TEST.

t stat = (xbar - mu) / sqrt(s^2/n)

xbar is sample mean, mu is pop mean, s^2 is sample var, with n-1 df

52
Q

Why does dividing by n underestimate the sample variance?

A

Since the standard dev is in the same units as the original data, we can draw it on the graph of the sample observations.

Dividing the variance by n-1 compensates for the fact that we are calculating differences (deviations) from the SAMPLE mean instead of the population mean….otherwise, we would systematically UNDERESTIMATE the variance around the population mean.

53
Q

What is a confidence interval and what assumptions can we make with it?

A

Given a sampling distribution, say by bootstrap 100x, we obtain 100 sample means.

A confidence interval is just an interval that COVERS 95% of the SAMPLE MEANS.

That’s IT.

What is the point of CIs?

CIs are statistical tests PERFORMED VISUALLY. Because 95% CI covers 95% of the sampling stat, we know that anything outside of the CI OCCURS < 5% of the time.

i. e. the sample region OUTSIDE the CI must have a probability of occurring < 5%.
i. e. the P-VALUE of any sample OUTSIDE of the CI is < 0.05 (and thus, significantly different than the TRUE MEAN).

Say we take weight samples between male and female mice and find their 95% CIs do NOT overlap. Then we can state that weights of male and female mice are STATISTICALLY different.

54
Q

What does R-squared quantify?

A

Say we have data:

target y = mice weight
variable x = mouse size

Now compute target mean ybar, and compute:

SST = sum(yi - ybar)^2

And then fit regression line yhat and compute:

SSR = sum(yi - yhati)^2
SSReg = sum(yhati - ybar)^2

R2 = Var(mean) - Var(fitted line) /Var(mean)

R2 QUANTIFIES the DIFFERENCE between the regression LINE and the target MEAN ybar:

R2 = 1 - (SSR/SST)

Thus, a PERFECTLY FITTED regression line will MINIMIZE SSR, s.t. R-squared will be MAXIMIZED toward 1.

e.g. Var(mean) = 32, Var(fitted line) = 6
R2 = (32-6) / 32 = (26/32) = 0.81

i.e. there is 81% LESS VARIATION around the fitted line than NAIVE mean ybar OR the x,y RELATIONSHIP in the model ACCOUNTS for 81% of the VARIATION.

55
Q

How can we reconcile correlation r vs R2?

A

R2 is easier to interpret than correl r.

e.g. how much better is r = .7 than r = .5?

just convert r to R2 =:
R2 = .7^2 = 0.50 so 50% of orig variation is explained

R2 = 0.5^2 = 0.25 so 25% or orig variation is explained

Thus, with R2 it is EASY to see that the first correl is TWICE as GOOD as the second correl.

56
Q

What is the p value of flipping a coin 5x and getting 5H in a row?

A

A p-value is the probability that random chance COULD HAVE generated the OUTCOME in QUESTION, OR OBSERVING AN OUTCOME EQUAL OR MORE EXTREME.

For flipping a coin 5x, there are 32 total outcomes.

There is ONE outcome with 5 heads: HHHHH with proba 1/32

and there is one outcome EQUALLY AS EXTREME as 5 heads, which is 5 tails TTTTT, also with proba 1/32

so P-VALUE of flipping 5H is (1/32 + 1/32) = 1/16 = .0625

Notice: such an “extreme” event of HHHHH or TTTTT is not alpha < .05!

57
Q

What is the point of maximum likelihood?

A

Given a data distribution, the goal of max likelihood is to fit the OPTIMAL distribution to the DATA.

Why do we care to fit a dist? Because we may have a hunch that the data is say a normal dist, which would entail that the data is centered around the mean by some std. BUT WHICH is the CENTER of the data?

The normal dist says that most data pts should be NEAR the CENTER of the dist.

Say we place the normal curve centered over the lhs tail in the dataset s.t. the rhs part of curve falls onto the actual mean of the data pts. Then the norm dist says that the PROBABILITY (i.e. LIKELIHOOD) of observing the pts under its rhs tail is LOW, while in this case, the pts under the right tail is ACTUALLY where most of the pts exist.

We can plot the likelihood on y vs. location of dist center. We want the MAXIMUM LIKELIHOOD of that plot.

Thus, the normal dist curve centered over the actual sample mean is the MAXIMUM LIKELIHOOD ESTIMATE for the MEAN.

So a MAX LIKELIHOOD for a particular sample stat is the STATISTIC that MAXIMIZES the LIKELIHOOD (probability) that we observed the stats we observed.

Probability and LIKELIHOOD are the same idea, but in this statistical context, proba is called LIKELIHOOD.

Summary, LIKELIHOOD is how we FIT a DISTRIBUTION to DATA.

58
Q

What does covariance tell us?

A

Cov = Sum(xi - xbar)(yi - ybar) /(n-1)

Cov can classify three types of relationships:

  1. negative trends (cov < 0)
  2. positive trends (cov >0)
  3. no trend (cov = 0)

When the Cov is positive, it tells us the SLOPE of the relationship between X,Y is POSITIVE, i.e. we CLASSIFY the TREND as POSITIVE.

The Cov does NOT tell us the STRENGTH of the relationship.

However, cov on its own is not interesting:

Covariance is a computational STEPPING STONE to something more interesting, like CORRELATION and PCA.

59
Q

Why is covariance so difficult to interpret?

A

Covariance is sensitive to the scale of the data, which makes it difficult to interpret.

The sensitivity to scale also prevents the cov value from telling us if the data are close (on) the line that represents the relationship, or scattered far from the line.

When the data pts are FAR from the line fitting X,Y then cov is LARGER.

60
Q

What is a z-score (standard score)?

A

A (standard) z-score is the signed number of standard deviations by which the value on an obs or data point is ABOVE the MEAN value of what is being measured.

It is calculated by subtracting the population MEAN from an individual raw score, then diving this difference by the POPULATION standard deviation:

z = (x - mu) / sigma

This conversion process is called STANDARDIZING or NORMALIZING.

Computing a z-score requires knowing the mean and the std of the COMPLETE POPULATION to which that data point belongs; if one only has a SAMPLE of obs from the population, then the analogous computation with SAMPLE mean and std yields the t-statistic.

61
Q

How do you compute a confidence interval for say proportion of web link clicks?

A

variable of interest is web link clicks among visitors to site, so a proportion p.

X = {number of users who clicked on link}
N = number of users

phat = X/N

say phat = 100/1000 = 0.1

A good rule of thumb to assume normal dist approximation for our sample is whether N*phat > 5.

Since N*phat ~= 100, we can assume normality.

the margin m from a mean, or statistic phat, at a specified alpha/2 level is:

m = z_alpha/2 *STD) for known population std
m = z_alpha/2 *(sigma/sqrt(n) ) for known population std
m = t_alpha/2 *SE) for unknown population std.
m = t_alpha/2 *(s/sqrt(n) ) for unknown population std.

then a CI at alpha significance is:
(xbar - m, xbar +m)

e.g. an alpha .05 CI with know pop std and mean = 0 is:

z_alpha/2 = 1.96

var of a binom proportion is (pq)/n
var = (.10*.90)/1000 = (.09/1000)
var = .00009
std = sqrt(var) = se = .009487

m = 1.96(se) = 1.96.009487
m = .0186
m ~=.019
CI = (-.081, .119)

62
Q

Compute a confidence interval for:

n=2000
X=300
confidence level = 99%

A

z score at (alpha_.01/2 = alpha_.005)

note: alpha level is like a p value, where

p value = P(z>std)
= 1-P(z

63
Q

What is the point of a hypothesis test?

A

A hypothesis test is a QUANTITATIVE way to establish how LIKELY it is that your results OCCURRED by CHANCE.

First we establish a BASELINE NULL hypothesis that there is NO difference between baseline CONTROL and alternative EXPERIMENT.

Then establish an ALTERNATIVE hypothesis which specifies some DIFFERENCE between NULL.

Fundamentally, the sampling distributions from the NULL control group and the experiment group will each exhibit normal distributions.

We can check for DIFFERENCES in SAMPLING DISTRIBUTION means from each group, by checking whether the group MEANS are statistically different, which we can apply p value that OBSERVED differences are MORE extreme than normal distribution RANDOM CHANCE suggests.

64
Q

What is a standard error?

A

The standard error (SE) of a test statistic (usually an estimate of a parameter) is the STANDARD DEVIATION of its SAMPLING DISTRIBUTION, or an estimate of that std.

A sampling distribution of a population mean is generated by repeated SAMPLING and RECORDING of means observed. This forms a distribution of DIFFERENT means and THIS DISTRIBUTION has its OWN mean, variance.

Mathematically, the var of the sampling dist obtained is equal to the VARIANCE of the population divided by the SAMPLE SIZE. This is because as the sample size increases, sample mean CLUSTERS more CLOSELY around the MEAN.

Thus, the relationship between the STANDARD ERROR and the standard deviation is such that, FOR a GIVEN SAMPLE SIZE, the standard error equals the standard dev DIVIDED by the SAMPLE SIZE.

In other words, the STANDARD ERROR is a measure of DISPERSION of SAMPLE mean around the population mean.

The SE of the population mean is:
sigma_xbar = sigma/sqrt(n)

But since the population std is seldom known, the SE of the mean is usually ESTIMATED as teh SAMPLE std divided by sqrt(sample size):

sigma_xbar ~= s/sqrt(n)

where s is SAMPLE std.

The std of the SAMPLE mean is equivalent to the std of the ERROR in the SAMPLE mean with respect to the true mean, since the sample mean is an UNBIASED estimator.

65
Q

What are the error types in hypothesis testing?

A

Type I error is the REJECTION of a TRUE NULL (conclude a FALSE POSITIVE).

Type II error is ACCEPTANCE of FALSE NULL (conclude a FALSE NEGATIVE).

FALSE means the test conclusion drawn is INCORRECT.

A type I error leads to the conclusion that a supposed ALTERNATIVE EFFECT EXISTS when it in fact doesn’t (e.g. like CRYING WOLF, conclude patient has cancer when she doesn’t, fire alarm sounds but there is NO FIRE).

A type II error leads to the conclusion that the ALTERNATIVE EFFECT does NOT exist, when in fact IT DOES (e.g. TEST fails to WORK, conclude patient does NOT HAVE cancer when she DOES, alarm does NOT sound but FIRE EXISTS).