Probability and Statistics Basics Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Prob: What are the two equivalent definitions of events A and B being independent?

A

P(A,B) = P(A)P(B)

OR

P(A) = P(A | B=b) for all values of b

(Pretty darn sure second is correct)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Prob: What are the two equivalent definitions of random variables Y1 and Y2 independent?

A

F(y1,y2) = F1(y1)F2(y2) (The joint dist factors to the marginal dists)

OR

F1(y1) = F(y1 | Y2 = y2) for all values of y2 (The marginal distribution for either variable is the same as the conditional distribution given any value of the other variable)

(Pretty darn sure second is correct)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Prob: Conceptually, what does it mean for A and B to be independent, either as variables or as events?

A

A and B are independent variables if the value of one variable gives you no information about the value of the other.

A and B are independent events if knowing whether one event happened or not gives you no information on whether the other happened.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Prob: What are the 2 equivalent definitions of variables X and Y to be uncorrelated?

A

Their linear correlation coefficient is 0.

OR

E[XY] = E[X]E[Y]. (This actually means their covariance is 0, but their covariance is 0 iff they’re uncorrelated)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Prob: Does 2 variables being independent imply they are uncorrelated?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Prob: Does 2 variables being uncorrelated imply they are independent?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Prob: What is an example of a distribution of 2 variables such that they are uncorrelated, but not independent? Why is it true in this case?

A

X = U(-1,1) and Y = X^2

Here, E(XY) = 0 = E(X)E(Y), because the distribution of XY is symmetric around 0

But, the value of X gives you information about Y – it in fact tells you Y specifically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Prob: What is Bayes’ Theorem?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Prob: What is a useful form of E[X^2]

A

E[X^2] = V[X] + (E[X]^2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Prob: What are DeMorgan’s Laws?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Prob: What is an experiment?

A

An activity with an observable outcome.

Ex. Rolling a die, or rolling 2 dice, or flipping a coin…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Prob: What is an outcome?

A

A unique result of an experiment.

For example, rolling a 6, where the experiment was rolling a die.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Prob: What is a sample space?

A

All of the possible outcomes of an experiment.

For example, [1,2,3,4,5,6], when the experiment is rolling a die.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Prob: What is an event?

A

A collection of outcomes forming a subset of the sample space.

For example, rolling an even number, if the experiment is rolling a die.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Prob: What is a formula for P(A union B)?

A

P(A) + P(B) - P(A and B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Prob: What is linearity of expectation?

A

E[cX + kY] = cE[X] + kE[Y], even if X and Y are dependent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Prob: What is one potentially convenient way to find P(A and B) when A and B are dependant?

A

P(A)*P(B|A), or P(B)*P(A|B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Stat: What proportion of points drawn from a normal distribution will fall within 1 standard deviation? 2? 3?

A

68% within 1, 95% within 2, 99.7% within 3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Prob: What is the law of total probability?

A

If you can decompose the sample space S into n parts B1,…,Bn, then

P(A) = P(A|B1)P(B1) + … + P(A|Bn)P(Bn)

A common form is

P(A) = P(A|B)P(B) + P(A|Bc)P(Bc)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Prob: What trick is often used in the denominator of a Bayes’ Rule problem?

A

Law of total probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Prob: What is a probability density function, or pdf f(), typically used for?

A

For a given probability distribution, you can integrate f() over an interval (or area, or n-d area) to find the probability that an experiment will fall in that interval/area.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Prob: what is a cumulative density function F(), or cdf, typically used for? How is it related to the pdf f()?

A

For a given probability distribution of RV X, F(x) = P(X<x></x>

<p>If you integrate f() from -inf to a, you get F(a)</p>

</x>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Prob: What is the formula for the expected value of discrete RV X?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Prob: What is the formula for E[g(X)], or the expected value of a function g of continuous RV X, with pdf f()?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Prob: V[aX+b]?

A

a2V[X]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Prob: Technically, what does it mean for a distribution Y to be memoryless?

A

P(Y > a+b|Y > b) = P(Y > a)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Prob: Conceptually, what does it mean for a probability distribution Y to be memoryless?

A

For an experiment, past behavior has no bearing on future behavior. For example, if you’re waiting for a bus to come and it follows a memoryless distribution (such as an exponential one), if you wait 5 minutes and there’s still no bus, the probability distribution of when it will arrive starting now, after 5 minutes is the same as it was when the experiment began.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Prob: What is the phenomenon being observed in a geometric probability distribution?

A

We have an event such as a coin toss with probability p of succeeding, and we keep performing attempts until we succeed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Prob: If we have probability p of succeeding, what is the probability that geometric random variable Y=y?

A

(1-p)y-1p

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Prob: what is the expected value of a geometric random variable with probability p of success?

A

1/p

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Prob: What phenomenon is observed by a binomial probability distribution?

A

We have an event, such as flipping a coin, with probability p of success, and we look to see how many of our n trials will be successes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Prob: If our binobial distribution has events with probability p of success, and we conduct n events, what is the probability that y will be successes (assuming 0 <= y <= n)? And what is the intuition behind this result?

A

py(1-p)n-y is the odds of a specific result with y successes (so y specific positions being successes, and the other n-y being failures). But we need the probability of any; these occurrences are disjoint, so we sum their probabilities by multiplying by the number of such potential outcomes, which is n choose y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Prob: What is the expected value of a binomial distribution with n trials and probability of success p?

A

np

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Prob: What is the expected value of Uniform(a,b)

A

(a-b)/2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Prob: What is the pdf f(x) of Uniform(a,b)

A

f(x) = 1/(b-a)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Prob: In words, what is the law of large numbers?

A

When sampling from a distribution, as the number of samples grows, the sampling mean will tend towards the expected value of the distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Stat: In normal distribution notation N(a,b), is b sigma, or sigma2?

A

sigma2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Normal: If Y follows N(µ,ð2), what is the formula for the z-score Z of Y=y?

A

Z = (y - µ)/ð

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Normal: If Y follows N(µ,ð2), what (in words) is the z-score of Y=y?

A

The number of standard deviations ð that y is above or below the mean µ.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Stat: What does the standard normal distribution Z follow?

A

Z follows N(0,1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Stat: In the context of normal distributions, what is the function shown below, what is its input from an arbitrary normal distribution N, and what does it tell us?

A

It is the CDF of the standard normal distribution Z.

It’s input is the z-score of your result.

It tells us the probability of getting a result with a z-score as low or lower than your result.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Prob: What is the formula for Cov(X,Y)?

A

Cov(X,Y) = E[XY] - E[X]E[Y]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Prob: What is the law of total expectation?

A

We can find E[X] by taking the weighted sum of the conditional expectations of X given all values of a variable Y.

For example, if Y = Y1 or Y2, then

E[X] = E[X|Y1]P(Y1) + E[X|Y2]P(Y2)

44
Q

Prob: What is the formula for the conditional expectation of X given that Y=y?

A
45
Q

Prob: How can the law of total probability be written concisely when trying to calculate E[X] based on additional variable y?

A

E[X] = E[E[X|Y]]

46
Q

Stat: What is the covariance matrix of random vector [A, B, C]?

A

(In brackets):

V(A) Cov(A,B) Cov(A,C)

Cov(B,A) V(B) Cov(B,C)

Cov(C,A) Cov(C,B) V(B)

47
Q
A
48
Q

Stat: What is our estimator Ø_h (theta-hat) a function of?

A

The data X1, X2, X3…! (This is important!)

49
Q

Stat: Given what our estimator Ø_h is a function of, what 2 important properties does it have?

A

It is a random variable

Which means it has its own probability distribution, with E[Ø_h], V[Ø_h], etc

50
Q

Stat: What, in these flashcards, is my notation for Theta and Thet-hat?

A

Ø and Ø_h respectively.

51
Q

Stat: How do you find the sample variance s2 given data X1,…,Xn?

A
52
Q

Stat: Why is n-1 used in the formula for sample variance and sample standard deviation in place of n?

A

To correct for biases that these statistics have in estimating the actual variance and standard deviation of the distribution, respectively. They both tend to slightly underestimate their targets, and this change makes them unbiased estimators.

53
Q

Stat: What does it mean for an estimator Ø_h to be accurate?

A

It has a mean close to the true value Ø; in other words, its bias is low.

54
Q

Stat: What does it mean for an estimator Ø_h to be precise?

A

It tends to produce similar answers each time; in other words, its variance is low.

55
Q

Stat: What is the formula for MSE(Ø_h), or Mean Squared Error?

A

MSE(Ø_h) = E[(Ø_h - Ø)2 ]

= V(Ø_h) + bias(Ø_h)2

56
Q

Stat: What is the formula for the bias of Ø_h? What does it mean for Ø_h to be unbiased?

A

Bias(Ø_h) = E[Ø_h - Ø]

Ø_h is unbiased iff Bias(Ø_h) = 0, or if the expected value of Ø_h is the correct value Ø.

57
Q

Stat: What is the standard error of Ø_h? And what in general does this quantity represent?

A

SE(Ø_h) = sqrt(V[Ø_h])

It is an idea of the typical error of the estimator, or the typical distance it will be from its mean.

58
Q
A
59
Q

Stat: What is probably the most common measure of the quality of estimator Ø_h?

A

Mean Squared Error, or MSE(Ø_h)

60
Q
A

It describes how likely those a distribution with those parameters was to make that dataset.

(I think it is often talked about in the context of a specific family of distributions. So we might say, what is the likelihood of a normal distribution with these paramaters, given this dataset?)

61
Q
A
62
Q

Stat: What does i.i.d. stand for?

A

Independent and Identically Distributed

63
Q

Stat: What is a MVUE?

A

It’s a Minimim-Variance Unbiased Estimator. So for some parameter Ø, it’s the unbiased estimator Ø_h with the lowest variance out of all the unbiased estimators.

64
Q

Stat: When we find a Maximum Likelihood Estimator, Min-Var Unbiased Estimator, Method of Moments Estimator, or something similar, do we typically find it in the context of some assumed distribution family (i.e. assume the distribution is normal, exponential, etc), or estimate parameters without a suspected distribution?

A

While sometimes we estimate parameters without a suspected distribution, such as distribution mean and variance, we generally more often use an assumed distribution family.

(This is mostly my opinion, and also me wanting to remember that when we for example “find the MLE”, it generally has quite a bit of structure due to an assumed distribution that we can differentiate/optimize.)

65
Q

Stat: What is a Maximum Likelihood Estimator?

A

It is the estimator Ø_hat of Ø that maximizes the likelihood of your data.

So, generally for some assumed distribution family such as Exponential Distributions, you try to find an estimator lambda_hat for parameter lambda that leads to the exponential distribution that was most likely to produce this data.

66
Q

Stat: If you have an assumed probability distribution family with one parameter, such as an exponential distribution with parameter lambda, how do you find the Maximum Likelihood Estimator lambda_hat for lambda? And what 2 tricks are most commonly used in finding the MLE?

A
67
Q

Stat: Given observations X1,…,Xn, what is the maximum likelihood estimator for a population proportion: for example, the proportion of red balls if we’re drawing from red, green or blue?

A

reds/n

68
Q

Stat: Define a 95% confidence interval [L,U] for parameter Ø.

A

For L and U, which are random variables based on your observations Xi, P(L <= Ø <= U) = 95%.

Meaning, when you sample your Xi’s and calcuulate L and U, the odds that then end up so L <= Ø <= U is 95%.

69
Q

Stat: What is the correct way to interpret 95% confidence interval [L,U] for parameter Ø?

What is a common incorrect way of interpreting it, and why is this incorrect?

A

Correct: “I am 95% confident that my calculated confidence interval [L,U] contains Ø.”

Incorrect: “There is a 95% chance that Ø is in the interval [L,U].”

The latter is incorrect because the true population parameter Ø is not a random variable. It is a set value that just exists in the world, and it either is in the interval or it isn’t; there is no chance involved.

70
Q

Stat: What is the way of interpreting a 95% confidence interval that involves considering if you computed many 95% confidence intervals?

A

If I compute a high number of 95% confidence intervals, over time, about 95% of them will contain their respective parameters.

71
Q

Stat: Given observations Xi and an unknown parameter Ø, what is a pivot?

A

A pivot is and expression that is :

  • A function of the observable R.V.’s (i.e. the observations Xi)
  • And of the unknown parameter Ø,
  • But no other unknowns.
  • And who’s distribution does not depend on the unknown Ø.

This is an important one!

72
Q

Stat: Given observations Xi and an unknown parameter Ø, if you have a pivot, what can the pivot be used for?

A

It can be used to create a confidence interval for Ø.

73
Q

Stat: At a high level, when finding a maximum likelihood estimator, what is the concept of Fisher Information, and what key use does it have?

A

The Fisher information notes that the sampling distribution of the maximum likelihood parameter estimate will follow a normal distribution. This distribution can be calculated and used to quantify the uncertainty of your parameter estimate.

74
Q

Stat: What is the Central Limit Theorem?

A
75
Q

Stat: Why is the Central Limit Theorem so important and useful?

A

Given enough sample size, we can find an approximate distribution for the sample mean of the Xi’s, but we don’t need to know anything about the underlying distribution of Xi! It doesn’t need to be of a specific family, and it can be an insane looking distribution, but we can still find an approximate distribution of the sample mean.

Using this, we can also find a confidence interval for the sample mean, which is great.

76
Q

If our observations are i.i.d. from N(µ,ð2), and we don’t know the value of ð2, what can be used as a pivot to make a confidence interval for µ? What distribution does this pivot follow?

A
77
Q

Stat: If we want to make a confidence interval for a parameter of a distribution that we think is normal, do we need to use the central limit theorem, or are there other methods?

A

We don’t need the CLT when we think the underlying distribution is normal; there are good pivots for estimating both the mean and the variance.

The CLT is more useful when the underlying distribution is arbitrary and/or very strange.

78
Q

Stat: This varies from table to table, as some have different definitions. But in your stats class, what was the definition of za?

A

If Z is the standard normal N(0,1), za is such that

P(Z > za) = a

Graphically, or verbally: the probability a draw from Z appearing above za is a.

79
Q

Stat: Given our definition of za (and with similarly defined quantities like ta,n-1 and chai-squareda,n-1), what is the probability expression used in almost all confidence intervals we make?

A

The following, which can be similarly written for t dist, chai-squared dist, etc. But it’s especially common to use the normal version, due to the CLT and all the great info we have about normals.

80
Q

Stat: In hypothesis testing, what is a null hypothesis?

A

Null Hypothesis Ho is the “status quo” or “safe hypothesis”. It is the baseline, and we are looking for significant evidence that it is not true. For example, when testing whether two groups have different performance on a task, the null hypothesis is that their performance is the same.

81
Q

Stat: In hypothesis testing, what is an alternative hypothesis?

A

The alternative hypothesis an idea that breaks from the “status quo” or “baseline assumption”, for which we are looking to see if there is significant evidence. For example, when testing whether two groups have different performance on a task, the alternative hypothesis could be that group A performs better than group B, for example.

82
Q

Stat: In hypothesis testing, what is a test statistic?

A

The test statistic in a hypothesis test is a function of your observable data which you will use to quantitatively examine your null and alternative hypotheses. For example, when testing whether two groups have different performance on a task, the test statistic might be the difference in mean performances of the 2 groups.

83
Q

Stat: What are the 2 possible conclusions an experimenter can make from a hypothesis test?

A

“Reject the null hypothesis in favor of the alternative,” and “Fail to reject the null hypothesis.”

84
Q

Stat: In hypothesis testing, what is the rejection region?

A

It is the predecided range of (extreme) values of the test statistic in which we will “reject the null hypothesis in favor of the alternative.”

85
Q

Stat: In hypothesis testing, what is a Type 1 error?

A

It is when we reject the null hypothesis Ho even though it is true.

86
Q

Stat: In hypothesis testing, what is a Type 2 Error?

A

It is when we fail to reject the null H0, but the alternative H1 is true,

87
Q
A
88
Q

Stat: In hypothesis testing, what does a “level 0.05 test” mean?

A

Alpha = 0.05

89
Q

Stat: In hypothesis testing, what would a low value of alpha such as 0.001 mean? What about a high value like 0.20?

A

A low value of alpha like 0.001 means that we require very compelling evidence (or very extreme values of our test statistic) in order to reject the null hypothesis.

Conversely, a high value like 0.20 means that we have very relaxed and un-stringent requirements for rejecting our null hypothesis.

90
Q

Stat: In hypothesis testing, what is a p-value?

A

Once you conduct your experiment and calculate the test statistic, the p-value is the probability of getting results that are as extreme or more extreme than your test statistic, under the assumption that the null hypothesis is true.

91
Q

Stat: In hypothesis testing, what value do we use to determine whether or not we reject the null.

A

The p-value. (We need to calculate the p-value from the test statistic under the assumption of the null, in order to see how unlikely our result is under the null.)

92
Q

Stat: In hypothesis testing, what do we conclude if the p-value is larger than alpha?

A

If say, p-val = 0.10 and alpha = 0.05, then our results are not as extreme as our alpha requires, and so we fail to reject the null hypothesis.

93
Q

Stat: In hypothesis testing, what do we conclude if the p-value is smaller than alpha?

A

If say, p-val = 0.01 and alpha = 0.05, then our results are more extreme than our alpha requires, and so we reject the null hypothesis in favor of the alternative hypothesis.

94
Q

Stat: What are the 4 key facts about a multivariate normal distribution of (X1, X2, …)?

A
  1. Every marginal distribution Xi is univariate normal (and every subset of the Xi’s has its own multivariate normal joint distribution).
  2. Any conditional distribution f(Xj | Xi = xi) is univariate normal.
  3. A pair of variables Xi, Xj is independent iff they are uncorrelated (iff their covariance is 0).
  4. All linear combinations of the covariates are univariate normal (unless all of the coefficients are 0, of course).
95
Q

Stat: In hypothesis testing, what is a one-sided hypothesis?

A

We reject only if the test statistic is extreme in one of the two directions. For example, if the null is µ = 0, the alternative is µ > 0.

96
Q

Stat: In hypothesis testing, what is a two-sided hypothesis?

A

We reject if the test statistic is extreme in either directions. For example, if the null is µ = 0, the alternative is µ =/= 0, and we reject if the test statistic is extremely high or extremely low.

97
Q

Stat: At a high level, what is the Power of a hypothesis test?

A

The probability that we correctly reject H0 when H1 is true. In otherwords, our ability to avoid type 2 errors.

98
Q

Stat: What is the key feature of classical, or frequentist, statistics? And what are some types of analytical tools used in this statistical philosophy?

A

In classical/frequentist statistics, the parameter Ø is constant. We examine it using estimators Ø_h, we quantify our uncertainty of its value using confidence intervals, and we test theories using hypothesis tests and p-values.

99
Q

Stat: What is the key feature of bayesian statistics? And what are some types of analytical tools used in this statistical philosophy?

A

The parameter Ø is viewed as variable, and we quantify our opinions around its potential values using a prob dist π.

100
Q

Stat: Given the formula for Cov(X,Y), what is the formula for linear correlation Corr(X,Y)?

A

Corr(X,Y) = Cov(X,Y)/sqrt[V(X)V(Y)]

101
Q

Stats: In Bayesian statistics, how do we update π, our prior distribution of Ø, using data Xi?

A

You incorporate using a method looking very similar to Bayes’ law.

Specifically:

102
Q

Stats: In bayesian statistics, what happens to the prior distribution as we get more and more data?

A

With enough data, the impact of the prior distribution on the posterior distribution tends towards 0.

103
Q

Stats: At a high level, what is the purpose of ANOVA?

A

If you have n>2 groups, you test the null hypothesis that the means of all the groups are equal, against the alternative that there is some difference among the means.

104
Q

Stats: What does ANOVA stand for?

A

Analysis of Variance

105
Q

Stats: At a high level, how is ANOVA performed?

A

Using a global F test, which looks at the probability of seeing your observed sample means for all groups under the null assumption that all of the groups’ means are equal.