Probability and Statistics Basics Flashcards

Question 1

Q

Prob: What are the two equivalent definitions of events A and B being independent?

Answer

A

P(A,B) = P(A)P(B)

OR

P(A) = P(A | B=b) for all values of b

(Pretty darn sure second is correct)

Question 2

Q

Prob: What are the two equivalent definitions of random variables Y1 and Y2 independent?

Answer

A

F(y1,y2) = F1(y1)F2(y2) (The joint dist factors to the marginal dists)

OR

F1(y1) = F(y1 | Y2 = y2) for all values of y2 (The marginal distribution for either variable is the same as the conditional distribution given any value of the other variable)

(Pretty darn sure second is correct)

Question 3

Q

Prob: Conceptually, what does it mean for A and B to be independent, either as variables or as events?

Answer

A

A and B are independent variables if the value of one variable gives you no information about the value of the other.

A and B are independent events if knowing whether one event happened or not gives you no information on whether the other happened.

Question 4

Q

Prob: What are the 2 equivalent definitions of variables X and Y to be uncorrelated?

Answer

A

Their linear correlation coefficient is 0.

OR

E[XY] = E[X]E[Y]. (This actually means their covariance is 0, but their covariance is 0 iff they’re uncorrelated)

Question 5

Q

Prob: Does 2 variables being independent imply they are uncorrelated?

Question 6

Q

Prob: Does 2 variables being uncorrelated imply they are independent?

Question 7

Q

Prob: What is an example of a distribution of 2 variables such that they are uncorrelated, but not independent? Why is it true in this case?

Answer

A

X = U(-1,1) and Y = X^2

Here, E(XY) = 0 = E(X)E(Y), because the distribution of XY is symmetric around 0

But, the value of X gives you information about Y – it in fact tells you Y specifically.

Question 8

Q

Prob: What is Bayes’ Theorem?

Question 9

Q

Prob: What is a useful form of E[X^2]

Answer

A

E[X^2] = V[X] + (E[X]^2)

Question 10

Q

Prob: What are DeMorgan’s Laws?

Question 11

Q

Prob: What is an experiment?

Answer

A

An activity with an observable outcome.

Ex. Rolling a die, or rolling 2 dice, or flipping a coin…

Question 12

Q

Prob: What is an outcome?

Answer

A

A unique result of an experiment.

For example, rolling a 6, where the experiment was rolling a die.

Question 13

Q

Prob: What is a sample space?

Answer

A

All of the possible outcomes of an experiment.

For example, [1,2,3,4,5,6], when the experiment is rolling a die.

Question 14

Q

Prob: What is an event?

Answer

A

A collection of outcomes forming a subset of the sample space.

For example, rolling an even number, if the experiment is rolling a die.

Question 15

Q

Prob: What is a formula for P(A union B)?

Answer

A

P(A) + P(B) - P(A and B)

Question 16

Q

Prob: What is linearity of expectation?

Answer

A

E[cX + kY] = cE[X] + kE[Y], even if X and Y are dependent

Question 17

Q

Prob: What is one potentially convenient way to find P(A and B) when A and B are dependant?

Answer

A

P(A)*P(B|A), or P(B)*P(A|B)

Question 18

Q

Stat: What proportion of points drawn from a normal distribution will fall within 1 standard deviation? 2? 3?

Answer

A

68% within 1, 95% within 2, 99.7% within 3

Question 19

Q

Prob: What is the law of total probability?

Answer

A

If you can decompose the sample space S into n parts B1,…,Bn, then

P(A) = P(A|B1)P(B1) + … + P(A|Bn)P(Bn)

A common form is

P(A) = P(A|B)P(B) + P(A|B^c)P(B^c)

Question 20

Q

Prob: What trick is often used in the denominator of a Bayes’ Rule problem?

Answer

A

Law of total probability

Question 21

Q

Prob: What is a probability density function, or pdf f(), typically used for?

Answer

A

For a given probability distribution, you can integrate f() over an interval (or area, or n-d area) to find the probability that an experiment will fall in that interval/area.

Question 22

Q

Prob: what is a cumulative density function F(), or cdf, typically used for? How is it related to the pdf f()?

Answer

A

For a given probability distribution of RV X, F(x) = P(X

If you integrate f() from -inf to a, you get F(a)

Question 23

Q

Prob: What is the formula for the expected value of discrete RV X?

Question 24

Q

Prob: What is the formula for E[g(X)], or the expected value of a function g of continuous RV X, with pdf f()?

Question 25

Q

Prob: V[aX+b]?

Question 26

Q

Prob: Technically, what does it mean for a distribution Y to be memoryless?

Answer

A

P(Y > a+b|Y > b) = P(Y > a)

Question 27

Q

Prob: Conceptually, what does it mean for a probability distribution Y to be memoryless?

Answer

A

For an experiment, past behavior has no bearing on future behavior. For example, if you’re waiting for a bus to come and it follows a memoryless distribution (such as an exponential one), if you wait 5 minutes and there’s still no bus, the probability distribution of when it will arrive starting now, after 5 minutes is the same as it was when the experiment began.

Question 28

Q

Prob: What is the phenomenon being observed in a geometric probability distribution?

Answer

A

We have an event such as a coin toss with probability p of succeeding, and we keep performing attempts until we succeed.

Question 29

Q

Prob: If we have probability p of succeeding, what is the probability that geometric random variable Y=y?

Answer

A

(1-p)^y-1p

Question 30

Q

Prob: what is the expected value of a geometric random variable with probability p of success?

Question 31

Q

Prob: What phenomenon is observed by a binomial probability distribution?

Answer

A

We have an event, such as flipping a coin, with probability p of success, and we look to see how many of our n trials will be successes.

Question 32

Q

Prob: If our binobial distribution has events with probability p of success, and we conduct n events, what is the probability that y will be successes (assuming 0 <= y <= n)? And what is the intuition behind this result?

Answer

A

p^y(1-p)^n-y is the odds of a specific result with y successes (so y specific positions being successes, and the other n-y being failures). But we need the probability of any; these occurrences are disjoint, so we sum their probabilities by multiplying by the number of such potential outcomes, which is n choose y.

Question 33

Q

Prob: What is the expected value of a binomial distribution with n trials and probability of success p?

Question 34

Q

Prob: What is the expected value of Uniform(a,b)

Question 35

Q

Prob: What is the pdf f(x) of Uniform(a,b)

Answer

A

f(x) = 1/(b-a)

Question 36

Q

Prob: In words, what is the law of large numbers?

Answer

A

When sampling from a distribution, as the number of samples grows, the sampling mean will tend towards the expected value of the distribution.

Question 37

Q

Stat: In normal distribution notation N(a,b), is b sigma, or sigma²?

Question 38

Q

Normal: If Y follows N(µ,ð²), what is the formula for the z-score Z of Y=y?

Answer

A

Z = (y - µ)/ð

Question 39

Q

Normal: If Y follows N(µ,ð²), what (in words) is the z-score of Y=y?

Answer

A

The number of standard deviations ð that y is above or below the mean µ.

Question 40

Q

Stat: What does the standard normal distribution Z follow?

Answer

A

Z follows N(0,1)

Question 41

Q

Stat: In the context of normal distributions, what is the function shown below, what is its input from an arbitrary normal distribution N, and what does it tell us?

Answer

A

It is the CDF of the standard normal distribution Z.

It’s input is the z-score of your result.

It tells us the probability of getting a result with a z-score as low or lower than your result.

Question 42

Q

Prob: What is the formula for Cov(X,Y)?

Answer

A

Cov(X,Y) = E[XY] - E[X]E[Y]

Question 43

Q

Prob: What is the law of total expectation?

Answer

A

We can find E[X] by taking the weighted sum of the conditional expectations of X given all values of a variable Y.

For example, if Y = Y₁ or Y₂, then

E[X] = E[X|Y₁]P(Y₁) + E[X|Y₂]P(Y₂)

Question 44

Q

Prob: What is the formula for the conditional expectation of X given that Y=y?

Question 45

Q

Prob: How can the law of total probability be written concisely when trying to calculate E[X] based on additional variable y?

Answer

A

E[X] = E[E[X|Y]]

Question 46

Q

Stat: What is the covariance matrix of random vector [A, B, C]?

Answer

A

(In brackets):

V(A) Cov(A,B) Cov(A,C)

Cov(B,A) V(B) Cov(B,C)

Cov(C,A) Cov(C,B) V(B)

Question 47

Q

Question 48

Q

Stat: What is our estimator Ø_h (theta-hat) a function of?

Answer

A

The data X1, X2, X3…! (This is important!)

Question 49

Q

Stat: Given what our estimator Ø_h is a function of, what 2 important properties does it have?

Answer

A

It is a random variable

Which means it has its own probability distribution, with E[Ø_h], V[Ø_h], etc

Question 50

Q

Stat: What, in these flashcards, is my notation for Theta and Thet-hat?

Answer

A

Ø and Ø_h respectively.

Question 51

Q

Stat: How do you find the sample variance s² given data X1,…,Xn?

Question 52

Q

Stat: Why is n-1 used in the formula for sample variance and sample standard deviation in place of n?

Answer

A

To correct for biases that these statistics have in estimating the actual variance and standard deviation of the distribution, respectively. They both tend to slightly underestimate their targets, and this change makes them unbiased estimators.

Question 53

Q

Stat: What does it mean for an estimator Ø_h to be accurate?

Answer

A

It has a mean close to the true value Ø; in other words, its bias is low.

Question 54

Q

Stat: What does it mean for an estimator Ø_h to be precise?

Answer

A

It tends to produce similar answers each time; in other words, its variance is low.

Question 55

Q

Stat: What is the formula for MSE(Ø_h), or Mean Squared Error?

Answer

A

MSE(Ø_h) = E[(Ø_h - Ø)²]

= V(Ø_h) + bias(Ø_h)²

Question 56

Q

Stat: What is the formula for the bias of Ø_h? What does it mean for Ø_h to be unbiased?

Answer

A

Bias(Ø_h) = E[Ø_h - Ø]

Ø_h is unbiased iff Bias(Ø_h) = 0, or if the expected value of Ø_h is the correct value Ø.

Question 57

Q

Stat: What is the standard error of Ø_h? And what in general does this quantity represent?

Answer

A

SE(Ø_h) = sqrt(V[Ø_h])

It is an idea of the typical error of the estimator, or the typical distance it will be from its mean.

Question 58

Q

Question 59

Q

Stat: What is probably the most common measure of the quality of estimator Ø_h?

Answer

A

Mean Squared Error, or MSE(Ø_h)

Question 60

Q

Answer

A

It describes how likely those a distribution with those parameters was to make that dataset.

(I think it is often talked about in the context of a specific family of distributions. So we might say, what is the likelihood of a normal distribution with these paramaters, given this dataset?)

Question 61

Q

Question 62

Q

Stat: What does i.i.d. stand for?

Answer

A

Independent and Identically Distributed

Question 63

Q

Stat: What is a MVUE?

Answer

A

It’s a Minimim-Variance Unbiased Estimator. So for some parameter Ø, it’s the unbiased estimator Ø_h with the lowest variance out of all the unbiased estimators.

Question 64

Q

Stat: When we find a Maximum Likelihood Estimator, Min-Var Unbiased Estimator, Method of Moments Estimator, or something similar, do we typically find it in the context of some assumed distribution family (i.e. assume the distribution is normal, exponential, etc), or estimate parameters without a suspected distribution?

Answer

A

While sometimes we estimate parameters without a suspected distribution, such as distribution mean and variance, we generally more often use an assumed distribution family.

(This is mostly my opinion, and also me wanting to remember that when we for example “find the MLE”, it generally has quite a bit of structure due to an assumed distribution that we can differentiate/optimize.)

Answer 49

A

It is the estimator Ø_hat of Ø that maximizes the likelihood of your data.

So, generally for some assumed distribution family such as Exponential Distributions, you try to find an estimator lambda_hat for parameter lambda that leads to the exponential distribution that was most likely to produce this data.

Answer 50

A

For L and U, which are random variables based on your observations X_i, P(L <= Ø <= U) = 95%.

Meaning, when you sample your X_i’s and calcuulate L and U, the odds that then end up so L <= Ø <= U is 95%.

Answer 51

A

Correct: “I am 95% confident that my calculated confidence interval [L,U] contains Ø.”

Incorrect: “There is a 95% chance that Ø is in the interval [L,U].”

The latter is incorrect because the true population parameter Ø is not a random variable. It is a set value that just exists in the world, and it either is in the interval or it isn’t; there is no chance involved.

Answer 52

A

If I compute a high number of 95% confidence intervals, over time, about 95% of them will contain their respective parameters.

Answer 53

A

A pivot is and expression that is :

A function of the observable R.V.’s (i.e. the observations X_i)
And of the unknown parameter Ø,
But no other unknowns.
And who’s distribution does not depend on the unknown Ø.

This is an important one!

Answer 54

A

It can be used to create a confidence interval for Ø.

Answer 55

A

The Fisher information notes that the sampling distribution of the maximum likelihood parameter estimate will follow a normal distribution. This distribution can be calculated and used to quantify the uncertainty of your parameter estimate.

Answer 56

A

Given enough sample size, we can find an approximate distribution for the sample mean of the X_i’s, but we don’t need to know anything about the underlying distribution of X_i! It doesn’t need to be of a specific family, and it can be an insane looking distribution, but we can still find an approximate distribution of the sample mean.

Using this, we can also find a confidence interval for the sample mean, which is great.

Answer 57

A

We don’t need the CLT when we think the underlying distribution is normal; there are good pivots for estimating both the mean and the variance.

The CLT is more useful when the underlying distribution is arbitrary and/or very strange.

Answer 58

A

If Z is the standard normal N(0,1), z_a is such that

P(Z > z_a) = a

Graphically, or verbally: the probability a draw from Z appearing above z_a is a.

Answer 59

A

The following, which can be similarly written for t dist, chai-squared dist, etc. But it’s especially common to use the normal version, due to the CLT and all the great info we have about normals.

Answer 60

A

Null Hypothesis H_o is the “status quo” or “safe hypothesis”. It is the baseline, and we are looking for significant evidence that it is not true. For example, when testing whether two groups have different performance on a task, the null hypothesis is that their performance is the same.

Answer 61

A

The alternative hypothesis an idea that breaks from the “status quo” or “baseline assumption”, for which we are looking to see if there is significant evidence. For example, when testing whether two groups have different performance on a task, the alternative hypothesis could be that group A performs better than group B, for example.

Answer 62

A

The test statistic in a hypothesis test is a function of your observable data which you will use to quantitatively examine your null and alternative hypotheses. For example, when testing whether two groups have different performance on a task, the test statistic might be the difference in mean performances of the 2 groups.

Answer 63

A

“Reject the null hypothesis in favor of the alternative,” and “Fail to reject the null hypothesis.”

Answer 64

A

It is the predecided range of (extreme) values of the test statistic in which we will “reject the null hypothesis in favor of the alternative.”

Answer 65

A

It is when we reject the null hypothesis H_o even though it is true.

Answer 66

A

It is when we fail to reject the null H₀, but the alternative H₁ is true,

Answer 67

A

Alpha = 0.05

Answer 68

A

A low value of alpha like 0.001 means that we require very compelling evidence (or very extreme values of our test statistic) in order to reject the null hypothesis.

Conversely, a high value like 0.20 means that we have very relaxed and un-stringent requirements for rejecting our null hypothesis.

Answer 69

A

Once you conduct your experiment and calculate the test statistic, the p-value is the probability of getting results that are as extreme or more extreme than your test statistic, under the assumption that the null hypothesis is true.

Answer 70

A

The p-value. (We need to calculate the p-value from the test statistic under the assumption of the null, in order to see how unlikely our result is under the null.)

Answer 71

A

If say, p-val = 0.10 and alpha = 0.05, then our results are not as extreme as our alpha requires, and so we fail to reject the null hypothesis.

Answer 72

A

If say, p-val = 0.01 and alpha = 0.05, then our results are more extreme than our alpha requires, and so we reject the null hypothesis in favor of the alternative hypothesis.

Answer 73

A

Every marginal distribution X_iis univariate normal (and every subset of the X_i’s has its own multivariate normal joint distribution).
Any conditional distribution f(X_j | X_i= x_i) is univariate normal.
A pair of variables Xi, Xj is independent iff they are uncorrelated (iff their covariance is 0).
All linear combinations of the covariates are univariate normal (unless all of the coefficients are 0, of course).

Answer 74

A

We reject only if the test statistic is extreme in one of the two directions. For example, if the null is µ = 0, the alternative is µ > 0.

Answer 75

A

We reject if the test statistic is extreme in either directions. For example, if the null is µ = 0, the alternative is µ =/= 0, and we reject if the test statistic is extremely high or extremely low.

Answer 76

A

The probability that we correctly reject H₀ when H₁ is true. In otherwords, our ability to avoid type 2 errors.

Answer 77

A

In classical/frequentist statistics, the parameter Ø is constant. We examine it using estimators Ø_h, we quantify our uncertainty of its value using confidence intervals, and we test theories using hypothesis tests and p-values.

Answer 78

A

The parameter Ø is viewed as variable, and we quantify our opinions around its potential values using a prob dist π.

Answer 79

A

Corr(X,Y) = Cov(X,Y)/sqrt[V(X)V(Y)]

Answer 80

A

You incorporate using a method looking very similar to Bayes’ law.

Specifically:

Answer 81

A

With enough data, the impact of the prior distribution on the posterior distribution tends towards 0.

Answer 82

A

If you have n>2 groups, you test the null hypothesis that the means of all the groups are equal, against the alternative that there is some difference among the means.

Answer 83

A

Analysis of Variance

Answer 84

A

Using a global F test, which looks at the probability of seeing your observed sample means for all groups under the null assumption that all of the groups’ means are equal.