4 - In All Probability Flashcards
What does probability deal with?
Reasoning in the presence of uncertainty
What is the Monty Hall dilemma?
A probability problem involving three doors, one hiding a car and two hiding goats
What is the initial probability of choosing the car behind Door No. 1?
One-third
What does the host do after you pick a door in the Monty Hall dilemma?
Opens another door revealing a goat
What should you do according to Marilyn vos Savant regarding switching doors?
Yes; you should switch
What is the probability of winning if you switch doors?
Two-thirds
What is the probability of winning if you do not switch doors?
One-third
Who was outraged by vos Savant’s answer to the Monty Hall dilemma?
Mathematicians and PhDs from American universities
What did Paul Erdős initially believe about switching doors in the Monty Hall dilemma?
He believed it made no difference
What did Andrew Vázsonyi use to convince Erdős that switching doors was advantageous?
A computer program running simulations
What are the two main approaches to thinking about probability discussed in the text?
Frequentist and Bayesian
What does the frequentist approach involve?
Dividing the number of times an event occurs by the total number of trials
What is Bayes’s theorem used for?
To draw conclusions with mathematical rigor amid uncertainty
What is the prior probability of having a disease if it occurs in 1 in 1,000 people?
0.001
What does P(H) represent in Bayes’s theorem?
The prior probability of a hypothesis being true
What does P(E|H) represent in Bayes’s theorem?
The probability of the evidence given the hypothesis
What is the posterior probability?
The prior probability updated given the evidence
If a test has a 90% accuracy, what is the probability of having the disease given a positive test result?
0.89 percent
What happens to the posterior probability if the test accuracy increases to 99%?
It rises to 0.09 or almost a 1-in-10 chance
What is the significance of Thomas Bayes’s contributions?
He laid the foundation for Bayesian probability and statistics
What happens if the disease becomes more common with the same test accuracy?
The probability of having the disease given a positive test rises to 0.5 or 50 percent
What is the probability that the car is behind Door No. 1 after the host opens Door No. 3?
Needs to be calculated using Bayes’s theorem
What is Bayes’s theorem formula?
P(H|E) = P(E|H) * P(H) / P(E)
What is P(E)?
The probability of testing positive
How do you calculate P(E)?
Sum of probabilities of testing positive from both having and not having the disease
What does the term ‘sensitivity’ refer to in the context of a medical test?
The probability that the test is positive when the subject has the disease
What does ‘specificity’ refer to in the context of a medical test?
The probability that the test is negative when the subject does not have the disease
What is the prior probability that the car is behind Door No. 1?
1/3
What is the probability that the host opens Door No. 3 if the car is behind Door No. 1?
1/2
What is P1 in the context of the probability that the host opens Door No. 3?
P (C1) × P (H3|C1) = 1/3 × 1/2 = 1/6
What is the probability that the host opens Door No. 3 if the car is behind Door No. 2?
1
What is P2 in the context of the probability that the host opens Door No. 3?
P (C2) × P (H3|C2) = 1/3 × 1 = 1/3
What is P3 in the context of the probability that the host opens Door No. 3?
P (C3) × P (H3|C3) = 1/3 × 0 = 0
What is the total probability that the host opens Door No. 3?
1/2
What should you do after the host opens Door No. 3, revealing a goat?
Switch doors
True or False: Most machine learning is inherently deterministic.
False
What does the perceptron algorithm find?
A hyperplane that can divide the data
What is a random variable?
A number assigned to the outcome of an experiment
What type of distribution is a Bernoulli distribution?
It dictates the way values of a discrete random variable are distributed.
In a Bernoulli distribution, what is the probability mass function P(X)?
P(X) states that P(X=1) is p and P(X=0) is (1 - p)
What is the expected value of a random variable?
The value expected over a large number of trials
How is variance calculated?
Sum of (each value of X - expected value of X)² * P(X)
What is the standard deviation?
The square root of the variance
What shape does the normal distribution have?
A bell-shaped curve
What percentage of observed values lie within one standard deviation of the mean in a normal distribution?
68 percent
What is the variance in relation to the standard deviation?
Variance is the square of the standard deviation
What does a larger standard deviation indicate about the distribution?
A broader, squatter plot
What is the mean of the distribution also known as?
Expected value
What is the probability of X = 0 in the coin toss experiment with 10 trials if 6 heads and 4 tails were observed?
0.6
What is the probability of X = 1 in the coin toss experiment with 10 trials if 6 heads and 4 tails were observed?
0.4
What does the expected value E(X) represent?
The average outcome of the random variable over many trials
Fill in the blank: The theoretical probability of getting heads on a single coin toss is ______.
1/2
What does sampling from an underlying distribution help us understand in machine learning?
The representative distribution of the data we have
What is the relationship between the number of trials and the expected difference in counts of heads and tails?
On the order of the square root of the total number of trials
What is the variance in relation to the standard deviation?
The variance is simply the square of the standard deviation.
What does a larger standard deviation indicate about a data distribution?
A larger standard deviation gives you a broader, squatter plot.
What characterizes a discrete random variable?
A discrete random variable is characterized by its probability mass function (PMF).
What characterizes a continuous random variable?
A continuous random variable is characterized by its probability density function (PDF).
Can you determine the probability of a specific value for a continuous random variable?
No, the probability of a specific, infinitely precise value is actually zero.
How is the probability that a continuous random variable falls within a range determined?
It is given by the area under the probability density function (PDF) bounded by the endpoints of that range.
What is the total area under a probability density function (PDF)?
The total area under the entire PDF equals 1.
What parameters are needed for the Bernoulli distribution?
The probability p.
What parameters are needed for the normal distribution?
The mean and variance.
In supervised learning, what does each instance of data represent?
Each instance of data is a d-dimensional vector.
In the context of supervised learning, what does the label y indicate?
y is -1 if the person did not have a heart attack, and 1 if they did.
What is the underlying probability distribution denoted as in supervised learning?
P(X, y).
What is the Bayes optimal classifier?
It is a classifier that predicts the category with the higher probability based on the underlying distribution.
What is maximum likelihood estimation (MLE)?
MLE estimates the best underlying distribution that maximizes the likelihood of observing the data.
What is the difference between MLE and MAP?
MLE maximizes P(D | θ), while MAP maximizes P(θ | D).
What does MAP stand for?
Maximum a posteriori estimation.
What is a common assumption made in Bayesian statistics?
That θ follows a distribution, meaning it is treated as a random variable.
What does the term ‘prior distribution’ refer to in Bayesian statistics?
It refers to the prior belief about the value of θ before observing the data.
What is a concrete example of a distribution characterized by parameters?
A Bernoulli distribution characterized by the value p.
What is a key feature of the Gaussian distribution?
It is characterized by its mean and variance.
What approach is often used when there is no closed-form solution to a maximization problem?
Gradient descent.
How do MLE and MAP behave as the amount of sampled data grows?
They begin converging in their estimate of the underlying distribution.
Who were the two statisticians that first used Bayesian reasoning for authorship attribution?
Frederick Mosteller and David Wallace.
What problem did Mosteller and Wallace tackle using Bayesian reasoning?
The authorship of the disputed Federalist Papers.
What was the primary reason for the dispute over the authorship of the Federalist Papers?
Madison and Hamilton did not hurry to enter their claims and became bitter political enemies.
What was the outcome of Mosteller and Williams’ initial analysis of sentence lengths in the Federalist Papers?
The average lengths for Hamilton and Madison were practically identical, providing little discriminatory power.
What statistical measure did Mosteller and Williams calculate to analyze sentence lengths?
Standard deviation (SD).
What were the average sentence lengths for Hamilton and Madison?
34.55 and 34.59 respectively
What were the standard deviations of sentence lengths for Hamilton and Madison?
19 for Hamilton and 20 for Madison
What did Mosteller use as a teaching moment to educate his students on?
The difficulties of applying statistical methods
Who collaborated with Mosteller in the mid-1950s to explore Bayesian methods?
David Wallace
What did Douglass Adair suggest to Mosteller regarding The Federalist Papers?
To revisit the issue of authorship
What type of words did Mosteller and Wallace focus on for their analysis?
Function words
How did Mosteller and Wallace initially count the occurrence of function words?
By typing each word on a long paper tape
What issue did Mosteller encounter with the computer program used for counting?
It would malfunction after processing about 3000 words
What method did Mosteller and Wallace use to calculate authorship probability?
Bayesian analysis
What was the outcome of Mosteller and Wallace’s analysis regarding the disputed papers?
Overwhelming evidence for Madison’s authorship
What was the odds for Madison’s authorship of paper number 55?
80 to 1
What was the significance of Mosteller and Wallace’s work according to Patrick Juola?
It was a seminal moment for statisticians and was done objectively
What species of penguins were studied in the Palmer Archipelago?
Adélie, Gentoo, and Chinstrap
How many attributes were considered for each penguin in the study?
Five attributes
What is the function that the ML algorithm needs to learn?
f(x) = y
What is the problem with the assumption of linearly separable data?
It may not hold true with more data
What does Bayesian decision theory establish?
The bounds for the best predictions given the data
What does the histogram of Adélie penguins’ bill depth show?
The distribution of bill depths
What type of probability is calculated for a specific value of bill depth?
Class-conditional probability
What is Bayes’s theorem used for in the context of the penguin study?
To calculate the probabilities for each hypothesis
What is the prior probability that a penguin is a Gentoo based on the sample?
119/(119+146)
What is P(y = Gentoo)?
The prior probability that the penguin is a Gentoo, estimated as 119 / (119 + 146) = 0.45.
How is P(x | y = Gentoo) determined?
It is read off from the distribution depicted in the plot, specifically from the Gentoo part.
What does P(x) represent?
The probability that the bill has some particular depth, calculated as:
* P(x | Adélie) × P(Adélie)
* P(x | Gentoo) × P(Gentoo)
What is P(y = Gentoo | x)?
The posterior probability that the penguin is a Gentoo, given some bill depth x.
What is the Bayes optimal classifier?
A simple classifier using one feature (bill depth) to classify between two types of penguins, Gentoo and Adélie.
True or False: The Bayes optimal classifier is the best any ML algorithm can do.
True.
What does the term ‘posterior probability’ refer to?
The probability of a hypothesis after considering the evidence.
What limitations exist when estimating underlying distributions in machine learning?
We often do not have access to the true underlying distribution.
What are maximum likelihood estimation (MLE) and maximum a posteriori (MAP) estimation used for?
To approximate underlying distributions from a sample of data.
What happens when bill depth is used to distinguish Adélie from Chinstrap penguins?
They are indistinguishable using only bill depth.
What additional feature can improve classification between penguin species?
Bill length.
What is a probability density function (PDF)?
A function that describes the likelihood of a random variable to take on a particular value.
How does increasing the number of features affect the complexity of estimating probability distributions?
It increases the complexity and data requirements for accurate estimation.
Fill in the blank: If we have five features, each penguin can be represented as a vector in _______ space.
5D
What assumption simplifies the problem of estimating probability distributions in machine learning?
That all features are sampled independently from their own distributions.
What is a naïve Bayes classifier?
A classifier that assumes mutually independent features to simplify probability calculations.
What is the probability mass function?
A function that gives the probability that a discrete random variable is equal to a specific value.
What does D ~ P(X, y) signify?
The data D is sampled from the underlying distribution P(X, y).
What is the parameter θ in the context of probability distributions?
The parameters that define the distribution, varying for different types.
What is the goal of maximum likelihood estimation (MLE)?
To find the parameter θ that maximizes the likelihood of the data.
True or False: The more samples we have, the better the histogram will be in representing the true underlying distribution.
True.
What is maximum likelihood estimation (MLE)?
MLE tries to find the θ that maximizes the likelihood of the data, meaning it finds the θ that maximizes P θ (X, y)
MLE is a method used in statistics to estimate parameters of a statistical model.
What does maximum a posteriori (MAP) estimation assume about θ?
MAP assumes that θ is a random variable and allows for specifying a probability distribution for it
MAP incorporates prior beliefs about θ, which is known as the prior.
What is the prior in the context of MAP estimation?
The prior is the initial assumption about how θ is distributed
For example, assuming a coin is fair or biased before observing any data.
What is the relationship between MAP estimation and the posterior probability distribution?
MAP finds the posterior probability distribution P θ (X, y) given the prior and the data
The posterior represents updated beliefs about θ after observing the data.
What does learning the entire joint probability distribution P θ (X, y) enable?
It enables generating new data that resemble the training data, leading to generative AI
This process involves sampling from the learned distribution.
What is the naïve Bayes classifier?
It is an algorithm that learns the joint probability distribution with simplifying assumptions and uses Bayes’s theorem
The naïve Bayes classifier is often used for classification tasks.
What is discriminative learning?
Discriminative learning focuses on calculating conditional probabilities of the data belonging to one class or another
It contrasts with generative learning, which models the entire data distribution.
What does P θ (y | x) represent?
P θ (y | x) represents the probability of the most likely class for a given feature vector x and optimal θ
This is used in discriminative learning to make predictions.
What is an example of an algorithm that uses discriminative learning?
An example is the nearest neighbor (NN) algorithm
The NN algorithm does not make assumptions about the underlying distribution of the data.
What kind of boundary does discriminative learning identify?
Discriminative learning identifies a boundary that separates clusters of data points
It can be a linear hyperplane or a nonlinear surface.
What is the significance of the nearest neighbor (NN) algorithm?
The NN algorithm achieved results nearly as good as the Bayes optimal classifier without underlying distribution assumptions
It was developed at Stanford in the 1960s.