Sampling Flashcards

Question 1

Q

Which real life events can be modelled as a coin flip

Answer

A

Any real-life binomial situation such as fraud/non-farud, buy/don’t buy,click/don’t click

Question 2

Q

What is a sample

Answer

A

A subset from a larger dataset

Question 3

Q

What is random sampling

Answer

A

Each member in population has equal chance of being picked during sampling procedure

Question 4

Q

What is stratified random sampling

Answer

A

You divide popullation into strata and randomly pick from each stratum. A stratum is a homogenous subgroup of a population with common characteristics, e.g Political pollsters might seek to learn the electoral preferences of whites, blacks, and Hispanics.

Question 5

Q

Sampling Bias

Answer

A

The sample is different from population it is supposed to represent in some meaningful way

Question 6

Q

Errors due to chance vs errors due to sampling

Answer

A

Picture shots at a target disk. Shots will not be centered around the target. An unbiased process will produce error, but it
is random and does not tend strongly in any direction. Whereas for a biased process—there is still random error in both the x
and y direction, but there is also a bias. Shots tend to fall in the upper-right quadrant.

Question 7

Q

When do you need large amount of data

Answer

A

When data is not only big but also sparse, e.g Google queries.

Question 8

Q

Data snooping

Answer

A

Extensive search through data for the hunt of something interesting.If you torture the data long enough, sooner or later it will confess.

Question 9

Q

Regression to Mean

Answer

A

successive measurements
on a given variable: extreme observations tend to be followed by more central ones.
Attaching special focus and meaning to the extreme value can lead to a form of selection bias.In nearly all major sports, at least those played with a ball or puck, there are two ele‐
ments that play a role in overall performance:
* Skill
* Luck
Regression to the mean is a consequence of a particular form of selection bias. When
we select the rookie with the best performance, skill and good luck are probably con‐
tributing. In his next season, the skill will still be there, but very often the luck will
not be, so his performance will decline—it will regress. Same for genetic tendencies; for example, the children of extremely tall men tend not to
be as tall as their father

Question 10

Q

Sample Statistic and sample variablility

Answer

A

A metric calculated on a sample. This metric might be different had we drawn a different sample, hence there is sampling variability. The larger the sample, the narrower the distribution of the sample statistic. And the distribution of the sample statistic (such as the mean) is more bell-shaped than the data itself.

Question 11

Q

Central Limit Theorem

Answer

A

Means drawn from multiple samples will resemble the familiar normal curve even if the source population is
not normally distributed, provided that the sample size is large enough and the
departure of the data from normality is not too great.

If you sufficiently select random samples from a population with mean μ and standard deviation σ, then the distribution of the sample means will be approximately normally distributed with mean μ and standard deviation σ/sqrt{n}

The central limit theorem
allows normal-approximation formulas like the t-distribution to be used in calculating sampling distributions for inference—that is, confidence intervals and hypothesis
tests.

Question 12

Q

Standard Error

Answer

A

Sums of variability of in sampling distribution of a statistic. It is defined as the division of standard deviations of the samples by the square root of the sample sizes n. As the sample size increases, the standard error decreases.

Question 13

Q

Square root of n rule

Answer

A

The relationship between standard error and sample size. To reduce the standard error by a factor of 2, the sample size must be increased by a factor of 4.

Question 14

Q

Bootstrapping

Answer

A

Effective way to estimate the sampling distribution of a statistic, or of
model parameters, is to draw additional samples, with replacement, from the sample
itself and recalculate the statistic or model for each resample.

Question 15

Q

Confidence Intervals

Answer

A

A 95 % confidence
interval confidence interval is defined as a range of values such that with 95 %
probability, the range will contain the true unknown value of the parameter. A 95% confidence interval for a parameter, is the estimated_parameter +/- 2*standard_error(parameter). Also, the smaller the sample, the wider the interval (i.e., the greater the uncertainty).

Question 16

Q

Level of confidence

Answer

A

The percentage associated with the confidence interval

Question 17

Q

Binomial Distribution

Answer

A

Is the frquency distribution of the number of successes(x) in a given number of trials (n) with specified probbaility (p) of success in each trial. The mean is given by n*p

Question 18

Q

A drunker takes either a step forward or backward. The probability that he takes a forward step is 0.4. Find the probability that at end of 11 steps he is 1 step away from the starting point?

Answer

A

p_forward = 0.4
choices: step forward/step backward (2 outcomes)

There are 2 situations that fulfill the condition. Either the person takes 6 steps forward and 5 back. Or the person takes 6 steps back and 5 forwards.

Hence, we write:
Bin(11,6)(0.4)^6(0.6)^5 + Bin(11,5)(0.4)^5(0.6)^6

Question 19

Q

A coin is twice as likely to land head as a tail in a series of independent tosses. Find the probability that 3rd head occurs on the 5th toss

Answer

A

p_head = 2*p_tail
p_head + p_tail = 1
Hence, p_tail = 1/3 and p_head = 2/3

Prob of 2 heads in 4 trials:
Bin(4,2)(2/3)^2(1/3)^2

Prob of head on 5th toss:
(2/3)

So, the final results is:
Bin(4,2)(2/3)^2(1/3)^2 * (2/3)

Question 20

Q

Difference between the Bernoulli and Binomial distribution

Answer

A

The Bernoulli distribution represents the success or failure of a single Bernoulli trial. The Binomial Distribution represents the number of successes and failures in n independent Bernoulli trials. Both distributions are discrete!

Question 21

Q

Difference between Binomial Distribution and Geometric Distribution and Negative Distribution?

Answer

A

The geometric distribution is also for repeated Bernoulli trials, and it gives the probability that the first
k − 1 trials are failures, while the kth trial is the first success.Binomial: has a FIXED number of trials before the experiment begins and X counts the number of successes obtained in that fixed number.
Geometric: has a fixed number of successes (ONE…the FIRST) and counts the number of trials needed to obtain that first success. It is theoretically possible to proceed indefinitely without ever obtaining a success.

Sequence of independent 0-1 trials - =⇒ NEGATIVE BINOMIAL DISTRIBUTION
count number of trials until the nth 1 is observed.

Sequence of independent 0-1 trials =⇒ GEOMETRIC DISTRIBUTION
- count number of trials until first 1.

Question 22

Q

If you have three draws from a uniformly distributed random variable between 0 and 2, what is the probability that the median of three numbers is greater than 1.5?

Answer

A

Let x1,x2,x3 be such that x1<x2<x3. The median in this case will be given by x2. To have a median GREATER than 1.5, we need both x2 and x3 to be greater than 1.5, i.e we need 2 out of 3 numbers to be greater than 1.5.

Since we have a uniform distribution, using the cdf formula, we obtain the probability of one number > 1.5 as 1-((1.5-0)/(2-0)) = 0.25

Now the problem transforms into 3 trials and we want two or more success, where the success probability on each trial is 0.25

Bin(3,2)(0.25)^2(0.75) + Bin(3,3)(0.25)^3

Question 23

Q

A big Boeing jumbo jet has 400 seats for a trans–Atlantic flight from JFK Airport in New York. The probability that any particular passenger will not show up for this flight is 0.03, independent of other passengers. The average fare paid by a passenger who succeeds in boarding is $400. A passenger who shows up but cannot board is given $800 and a free flight later on.

Given that the airline can sell 400 + m tickets, calculate the m that maximizes expected revenue.

Answer

A

Let N be the number of people NOT showing up.

From the task we know that N~Bin(400+m, 0.03)
N>= m: 400+m-N <=400; everyone has set
N.
N<m: 400+m-N > 400; not everyone has a seat.

The revenue is defined as:
revenue(m,n) = 400(400+m-N) if N >=m
revenue(m,n) = 400(400+m-N)-1200(m-N) if N <m

Expected revenue is thus:
sum{n=0, 400+m)[Pr(N=n)*revenue(m,n)]

Question 24

Q

Ordinary fair dice is thrown repeatedly until the number 5 is thrown. Find the number of throws X.
Find probability that 5 is obtained between 3rd and 8th throw.
Find probability that first 5 will occur within the first 10 throws

Answer

A

X:= number of trails until 5 is obtained
p_success = 1/6
E[X] = 1/p = 6, the mean number of throws it takes

P(3<X<8)= P(X=4)+P(X=5)+P(X=6)+P(X=7)
= (1-1/6)^(4-1)(1/6)

P(X<10) = 1-P(X>=10)
which means that we had 9 successive failures.
P(X<10) = 1-P(X>=10) = 1-(5/6)^9