4 - In All Probability Flashcards

1
Q

What does probability deal with?

A

Reasoning in the presence of uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the Monty Hall dilemma?

A

A probability problem involving three doors, one hiding a car and two hiding goats

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the initial probability of choosing the car behind Door No. 1?

A

One-third

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the host do after you pick a door in the Monty Hall dilemma?

A

Opens another door revealing a goat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What should you do according to Marilyn vos Savant regarding switching doors?

A

Yes; you should switch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the probability of winning if you switch doors?

A

Two-thirds

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the probability of winning if you do not switch doors?

A

One-third

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Who was outraged by vos Savant’s answer to the Monty Hall dilemma?

A

Mathematicians and PhDs from American universities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What did Paul Erdős initially believe about switching doors in the Monty Hall dilemma?

A

He believed it made no difference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What did Andrew Vázsonyi use to convince Erdős that switching doors was advantageous?

A

A computer program running simulations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the two main approaches to thinking about probability discussed in the text?

A

Frequentist and Bayesian

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What does the frequentist approach involve?

A

Dividing the number of times an event occurs by the total number of trials

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Bayes’s theorem used for?

A

To draw conclusions with mathematical rigor amid uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the prior probability of having a disease if it occurs in 1 in 1,000 people?

A

0.001

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does P(H) represent in Bayes’s theorem?

A

The prior probability of a hypothesis being true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does P(E|H) represent in Bayes’s theorem?

A

The probability of the evidence given the hypothesis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the posterior probability?

A

The prior probability updated given the evidence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

If a test has a 90% accuracy, what is the probability of having the disease given a positive test result?

A

0.89 percent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What happens to the posterior probability if the test accuracy increases to 99%?

A

It rises to 0.09 or almost a 1-in-10 chance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the significance of Thomas Bayes’s contributions?

A

He laid the foundation for Bayesian probability and statistics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What happens if the disease becomes more common with the same test accuracy?

A

The probability of having the disease given a positive test rises to 0.5 or 50 percent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the probability that the car is behind Door No. 1 after the host opens Door No. 3?

A

Needs to be calculated using Bayes’s theorem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is Bayes’s theorem formula?

A

P(H|E) = P(E|H) * P(H) / P(E)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is P(E)?

A

The probability of testing positive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How do you calculate P(E)?

A

Sum of probabilities of testing positive from both having and not having the disease

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What does the term ‘sensitivity’ refer to in the context of a medical test?

A

The probability that the test is positive when the subject has the disease

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What does ‘specificity’ refer to in the context of a medical test?

A

The probability that the test is negative when the subject does not have the disease

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the prior probability that the car is behind Door No. 1?

A

1/3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the probability that the host opens Door No. 3 if the car is behind Door No. 1?

A

1/2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is P1 in the context of the probability that the host opens Door No. 3?

A

P (C1) × P (H3|C1) = 1/3 × 1/2 = 1/6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the probability that the host opens Door No. 3 if the car is behind Door No. 2?

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is P2 in the context of the probability that the host opens Door No. 3?

A

P (C2) × P (H3|C2) = 1/3 × 1 = 1/3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is P3 in the context of the probability that the host opens Door No. 3?

A

P (C3) × P (H3|C3) = 1/3 × 0 = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What is the total probability that the host opens Door No. 3?

A

1/2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What should you do after the host opens Door No. 3, revealing a goat?

A

Switch doors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

True or False: Most machine learning is inherently deterministic.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What does the perceptron algorithm find?

A

A hyperplane that can divide the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is a random variable?

A

A number assigned to the outcome of an experiment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What type of distribution is a Bernoulli distribution?

A

It dictates the way values of a discrete random variable are distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

In a Bernoulli distribution, what is the probability mass function P(X)?

A

P(X) states that P(X=1) is p and P(X=0) is (1 - p)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is the expected value of a random variable?

A

The value expected over a large number of trials

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

How is variance calculated?

A

Sum of (each value of X - expected value of X)² * P(X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is the standard deviation?

A

The square root of the variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What shape does the normal distribution have?

A

A bell-shaped curve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What percentage of observed values lie within one standard deviation of the mean in a normal distribution?

A

68 percent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What is the variance in relation to the standard deviation?

A

Variance is the square of the standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What does a larger standard deviation indicate about the distribution?

A

A broader, squatter plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is the mean of the distribution also known as?

A

Expected value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is the probability of X = 0 in the coin toss experiment with 10 trials if 6 heads and 4 tails were observed?

A

0.6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What is the probability of X = 1 in the coin toss experiment with 10 trials if 6 heads and 4 tails were observed?

A

0.4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What does the expected value E(X) represent?

A

The average outcome of the random variable over many trials

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Fill in the blank: The theoretical probability of getting heads on a single coin toss is ______.

A

1/2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What does sampling from an underlying distribution help us understand in machine learning?

A

The representative distribution of the data we have

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What is the relationship between the number of trials and the expected difference in counts of heads and tails?

A

On the order of the square root of the total number of trials

55
Q

What is the variance in relation to the standard deviation?

A

The variance is simply the square of the standard deviation.

56
Q

What does a larger standard deviation indicate about a data distribution?

A

A larger standard deviation gives you a broader, squatter plot.

57
Q

What characterizes a discrete random variable?

A

A discrete random variable is characterized by its probability mass function (PMF).

58
Q

What characterizes a continuous random variable?

A

A continuous random variable is characterized by its probability density function (PDF).

59
Q

Can you determine the probability of a specific value for a continuous random variable?

A

No, the probability of a specific, infinitely precise value is actually zero.

60
Q

How is the probability that a continuous random variable falls within a range determined?

A

It is given by the area under the probability density function (PDF) bounded by the endpoints of that range.

61
Q

What is the total area under a probability density function (PDF)?

A

The total area under the entire PDF equals 1.

62
Q

What parameters are needed for the Bernoulli distribution?

A

The probability p.

63
Q

What parameters are needed for the normal distribution?

A

The mean and variance.

64
Q

In supervised learning, what does each instance of data represent?

A

Each instance of data is a d-dimensional vector.

65
Q

In the context of supervised learning, what does the label y indicate?

A

y is -1 if the person did not have a heart attack, and 1 if they did.

66
Q

What is the underlying probability distribution denoted as in supervised learning?

67
Q

What is the Bayes optimal classifier?

A

It is a classifier that predicts the category with the higher probability based on the underlying distribution.

68
Q

What is maximum likelihood estimation (MLE)?

A

MLE estimates the best underlying distribution that maximizes the likelihood of observing the data.

69
Q

What is the difference between MLE and MAP?

A

MLE maximizes P(D | θ), while MAP maximizes P(θ | D).

70
Q

What does MAP stand for?

A

Maximum a posteriori estimation.

71
Q

What is a common assumption made in Bayesian statistics?

A

That θ follows a distribution, meaning it is treated as a random variable.

72
Q

What does the term ‘prior distribution’ refer to in Bayesian statistics?

A

It refers to the prior belief about the value of θ before observing the data.

73
Q

What is a concrete example of a distribution characterized by parameters?

A

A Bernoulli distribution characterized by the value p.

74
Q

What is a key feature of the Gaussian distribution?

A

It is characterized by its mean and variance.

75
Q

What approach is often used when there is no closed-form solution to a maximization problem?

A

Gradient descent.

76
Q

How do MLE and MAP behave as the amount of sampled data grows?

A

They begin converging in their estimate of the underlying distribution.

77
Q

Who were the two statisticians that first used Bayesian reasoning for authorship attribution?

A

Frederick Mosteller and David Wallace.

78
Q

What problem did Mosteller and Wallace tackle using Bayesian reasoning?

A

The authorship of the disputed Federalist Papers.

79
Q

What was the primary reason for the dispute over the authorship of the Federalist Papers?

A

Madison and Hamilton did not hurry to enter their claims and became bitter political enemies.

80
Q

What was the outcome of Mosteller and Williams’ initial analysis of sentence lengths in the Federalist Papers?

A

The average lengths for Hamilton and Madison were practically identical, providing little discriminatory power.

81
Q

What statistical measure did Mosteller and Williams calculate to analyze sentence lengths?

A

Standard deviation (SD).

82
Q

What were the average sentence lengths for Hamilton and Madison?

A

34.55 and 34.59 respectively

83
Q

What were the standard deviations of sentence lengths for Hamilton and Madison?

A

19 for Hamilton and 20 for Madison

84
Q

What did Mosteller use as a teaching moment to educate his students on?

A

The difficulties of applying statistical methods

85
Q

Who collaborated with Mosteller in the mid-1950s to explore Bayesian methods?

A

David Wallace

86
Q

What did Douglass Adair suggest to Mosteller regarding The Federalist Papers?

A

To revisit the issue of authorship

87
Q

What type of words did Mosteller and Wallace focus on for their analysis?

A

Function words

88
Q

How did Mosteller and Wallace initially count the occurrence of function words?

A

By typing each word on a long paper tape

89
Q

What issue did Mosteller encounter with the computer program used for counting?

A

It would malfunction after processing about 3000 words

90
Q

What method did Mosteller and Wallace use to calculate authorship probability?

A

Bayesian analysis

91
Q

What was the outcome of Mosteller and Wallace’s analysis regarding the disputed papers?

A

Overwhelming evidence for Madison’s authorship

92
Q

What was the odds for Madison’s authorship of paper number 55?

93
Q

What was the significance of Mosteller and Wallace’s work according to Patrick Juola?

A

It was a seminal moment for statisticians and was done objectively

94
Q

What species of penguins were studied in the Palmer Archipelago?

A

Adélie, Gentoo, and Chinstrap

95
Q

How many attributes were considered for each penguin in the study?

A

Five attributes

96
Q

What is the function that the ML algorithm needs to learn?

97
Q

What is the problem with the assumption of linearly separable data?

A

It may not hold true with more data

98
Q

What does Bayesian decision theory establish?

A

The bounds for the best predictions given the data

99
Q

What does the histogram of Adélie penguins’ bill depth show?

A

The distribution of bill depths

100
Q

What type of probability is calculated for a specific value of bill depth?

A

Class-conditional probability

101
Q

What is Bayes’s theorem used for in the context of the penguin study?

A

To calculate the probabilities for each hypothesis

102
Q

What is the prior probability that a penguin is a Gentoo based on the sample?

A

119/(119+146)

103
Q

What is P(y = Gentoo)?

A

The prior probability that the penguin is a Gentoo, estimated as 119 / (119 + 146) = 0.45.

104
Q

How is P(x | y = Gentoo) determined?

A

It is read off from the distribution depicted in the plot, specifically from the Gentoo part.

105
Q

What does P(x) represent?

A

The probability that the bill has some particular depth, calculated as:
* P(x | Adélie) × P(Adélie)
* P(x | Gentoo) × P(Gentoo)

106
Q

What is P(y = Gentoo | x)?

A

The posterior probability that the penguin is a Gentoo, given some bill depth x.

107
Q

What is the Bayes optimal classifier?

A

A simple classifier using one feature (bill depth) to classify between two types of penguins, Gentoo and Adélie.

108
Q

True or False: The Bayes optimal classifier is the best any ML algorithm can do.

109
Q

What does the term ‘posterior probability’ refer to?

A

The probability of a hypothesis after considering the evidence.

110
Q

What limitations exist when estimating underlying distributions in machine learning?

A

We often do not have access to the true underlying distribution.

111
Q

What are maximum likelihood estimation (MLE) and maximum a posteriori (MAP) estimation used for?

A

To approximate underlying distributions from a sample of data.

112
Q

What happens when bill depth is used to distinguish Adélie from Chinstrap penguins?

A

They are indistinguishable using only bill depth.

113
Q

What additional feature can improve classification between penguin species?

A

Bill length.

114
Q

What is a probability density function (PDF)?

A

A function that describes the likelihood of a random variable to take on a particular value.

115
Q

How does increasing the number of features affect the complexity of estimating probability distributions?

A

It increases the complexity and data requirements for accurate estimation.

116
Q

Fill in the blank: If we have five features, each penguin can be represented as a vector in _______ space.

117
Q

What assumption simplifies the problem of estimating probability distributions in machine learning?

A

That all features are sampled independently from their own distributions.

118
Q

What is a naïve Bayes classifier?

A

A classifier that assumes mutually independent features to simplify probability calculations.

119
Q

What is the probability mass function?

A

A function that gives the probability that a discrete random variable is equal to a specific value.

120
Q

What does D ~ P(X, y) signify?

A

The data D is sampled from the underlying distribution P(X, y).

121
Q

What is the parameter θ in the context of probability distributions?

A

The parameters that define the distribution, varying for different types.

122
Q

What is the goal of maximum likelihood estimation (MLE)?

A

To find the parameter θ that maximizes the likelihood of the data.

123
Q

True or False: The more samples we have, the better the histogram will be in representing the true underlying distribution.

124
Q

What is maximum likelihood estimation (MLE)?

A

MLE tries to find the θ that maximizes the likelihood of the data, meaning it finds the θ that maximizes P θ (X, y)

MLE is a method used in statistics to estimate parameters of a statistical model.

125
Q

What does maximum a posteriori (MAP) estimation assume about θ?

A

MAP assumes that θ is a random variable and allows for specifying a probability distribution for it

MAP incorporates prior beliefs about θ, which is known as the prior.

126
Q

What is the prior in the context of MAP estimation?

A

The prior is the initial assumption about how θ is distributed

For example, assuming a coin is fair or biased before observing any data.

127
Q

What is the relationship between MAP estimation and the posterior probability distribution?

A

MAP finds the posterior probability distribution P θ (X, y) given the prior and the data

The posterior represents updated beliefs about θ after observing the data.

128
Q

What does learning the entire joint probability distribution P θ (X, y) enable?

A

It enables generating new data that resemble the training data, leading to generative AI

This process involves sampling from the learned distribution.

129
Q

What is the naïve Bayes classifier?

A

It is an algorithm that learns the joint probability distribution with simplifying assumptions and uses Bayes’s theorem

The naïve Bayes classifier is often used for classification tasks.

130
Q

What is discriminative learning?

A

Discriminative learning focuses on calculating conditional probabilities of the data belonging to one class or another

It contrasts with generative learning, which models the entire data distribution.

131
Q

What does P θ (y | x) represent?

A

P θ (y | x) represents the probability of the most likely class for a given feature vector x and optimal θ

This is used in discriminative learning to make predictions.

132
Q

What is an example of an algorithm that uses discriminative learning?

A

An example is the nearest neighbor (NN) algorithm

The NN algorithm does not make assumptions about the underlying distribution of the data.

133
Q

What kind of boundary does discriminative learning identify?

A

Discriminative learning identifies a boundary that separates clusters of data points

It can be a linear hyperplane or a nonlinear surface.

134
Q

What is the significance of the nearest neighbor (NN) algorithm?

A

The NN algorithm achieved results nearly as good as the Bayes optimal classifier without underlying distribution assumptions

It was developed at Stanford in the 1960s.