Sampling Flashcards

1
Q

Which real life events can be modelled as a coin flip

A

Any real-life binomial situation such as fraud/non-farud, buy/don’t buy,click/don’t click

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a sample

A

A subset from a larger dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is random sampling

A

Each member in population has equal chance of being picked during sampling procedure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is stratified random sampling

A

You divide popullation into strata and randomly pick from each stratum. A stratum is a homogenous subgroup of a population with common characteristics, e.g Political pollsters might seek to learn the electoral preferences of whites, blacks, and Hispanics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Sampling Bias

A

The sample is different from population it is supposed to represent in some meaningful way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Errors due to chance vs errors due to sampling

A

Picture shots at a target disk. Shots will not be centered around the target. An unbiased process will produce error, but it
is random and does not tend strongly in any direction. Whereas for a biased process—there is still random error in both the x
and y direction, but there is also a bias. Shots tend to fall in the upper-right quadrant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When do you need large amount of data

A

When data is not only big but also sparse, e.g Google queries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data snooping

A

Extensive search through data for the hunt of something interesting.If you torture the data long enough, sooner or later it will confess.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Regression to Mean

A

successive measurements
on a given variable: extreme observations tend to be followed by more central ones.
Attaching special focus and meaning to the extreme value can lead to a form of selection bias.In nearly all major sports, at least those played with a ball or puck, there are two ele‐
ments that play a role in overall performance:
* Skill
* Luck
Regression to the mean is a consequence of a particular form of selection bias. When
we select the rookie with the best performance, skill and good luck are probably con‐
tributing. In his next season, the skill will still be there, but very often the luck will
not be, so his performance will decline—it will regress. Same for genetic tendencies; for example, the children of extremely tall men tend not to
be as tall as their father

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Sample Statistic and sample variablility

A

A metric calculated on a sample. This metric might be different had we drawn a different sample, hence there is sampling variability. The larger the sample, the narrower the distribution of the sample statistic. And the distribution of the sample statistic (such as the mean) is more bell-shaped than the data itself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Central Limit Theorem

A

Means drawn from multiple samples will resemble the familiar normal curve even if the source population is
not normally distributed, provided that the sample size is large enough and the
departure of the data from normality is not too great.

If you sufficiently select random samples from a population with mean μ and standard deviation σ, then the distribution of the sample means will be approximately normally distributed with mean μ and standard deviation σ/sqrt{n}

The central limit theorem
allows normal-approximation formulas like the t-distribution to be used in calculating sampling distributions for inference—that is, confidence intervals and hypothesis
tests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Standard Error

A

Sums of variability of in sampling distribution of a statistic. It is defined as the division of standard deviations of the samples by the square root of the sample sizes n. As the sample size increases, the standard error decreases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Square root of n rule

A

The relationship between standard error and sample size. To reduce the standard error by a factor of 2, the sample size must be increased by a factor of 4.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Bootstrapping

A

Effective way to estimate the sampling distribution of a statistic, or of
model parameters, is to draw additional samples, with replacement, from the sample
itself and recalculate the statistic or model for each resample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Confidence Intervals

A

A 95 % confidence
interval confidence interval is defined as a range of values such that with 95 %
probability, the range will contain the true unknown value of the parameter. A 95% confidence interval for a parameter, is the estimated_parameter +/- 2*standard_error(parameter). Also, the smaller the sample, the wider the interval (i.e., the greater the uncertainty).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Level of confidence

A

The percentage associated with the confidence interval

17
Q

Binomial Distribution

A

Is the frquency distribution of the number of successes(x) in a given number of trials (n) with specified probbaility (p) of success in each trial. The mean is given by n*p

18
Q

A drunker takes either a step forward or backward. The probability that he takes a forward step is 0.4. Find the probability that at end of 11 steps he is 1 step away from the starting point?

A

p_forward = 0.4
choices: step forward/step backward (2 outcomes)

There are 2 situations that fulfill the condition. Either the person takes 6 steps forward and 5 back. Or the person takes 6 steps back and 5 forwards.

Hence, we write:
Bin(11,6)(0.4)^6(0.6)^5 + Bin(11,5)(0.4)^5(0.6)^6

19
Q

A coin is twice as likely to land head as a tail in a series of independent tosses. Find the probability that 3rd head occurs on the 5th toss

A

p_head = 2*p_tail
p_head + p_tail = 1
Hence, p_tail = 1/3 and p_head = 2/3

Prob of 2 heads in 4 trials:
Bin(4,2)(2/3)^2(1/3)^2

Prob of head on 5th toss:
(2/3)

So, the final results is:
Bin(4,2)(2/3)^2(1/3)^2 * (2/3)

20
Q

Difference between the Bernoulli and Binomial distribution

A

The Bernoulli distribution represents the success or failure of a single Bernoulli trial. The Binomial Distribution represents the number of successes and failures in n independent Bernoulli trials. Both distributions are discrete!

21
Q

Difference between Binomial Distribution and Geometric Distribution and Negative Distribution?

A

The geometric distribution is also for repeated Bernoulli trials, and it gives the probability that the first
k − 1 trials are failures, while the kth trial is the first success.Binomial: has a FIXED number of trials before the experiment begins and X counts the number of successes obtained in that fixed number.
Geometric: has a fixed number of successes (ONE…the FIRST) and counts the number of trials needed to obtain that first success. It is theoretically possible to proceed indefinitely without ever obtaining a success.

Sequence of independent 0-1 trials - =⇒ NEGATIVE BINOMIAL DISTRIBUTION
count number of trials until the nth 1 is observed.

Sequence of independent 0-1 trials =⇒ GEOMETRIC DISTRIBUTION
- count number of trials until first 1.

22
Q

If you have three draws from a uniformly distributed random variable between 0 and 2, what is the probability that the median of three numbers is greater than 1.5?

A

Let x1,x2,x3 be such that x1<x2<x3. The median in this case will be given by x2. To have a median GREATER than 1.5, we need both x2 and x3 to be greater than 1.5, i.e we need 2 out of 3 numbers to be greater than 1.5.

Since we have a uniform distribution, using the cdf formula, we obtain the probability of one number > 1.5 as 1-((1.5-0)/(2-0)) = 0.25

Now the problem transforms into 3 trials and we want two or more success, where the success probability on each trial is 0.25

Bin(3,2)(0.25)^2(0.75) + Bin(3,3)(0.25)^3

23
Q

A big Boeing jumbo jet has 400 seats for a trans–Atlantic flight from JFK Airport in New York. The probability that any particular passenger will not show up for this flight is 0.03, independent of other passengers. The average fare paid by a passenger who succeeds in boarding is $400. A passenger who shows up but cannot board is given $800 and a free flight later on.

Given that the airline can sell 400 + m tickets, calculate the m that maximizes expected revenue.

A

Let N be the number of people NOT showing up.

From the task we know that N~Bin(400+m, 0.03)
N>= m: 400+m-N <=400; everyone has set
N.
N<m: 400+m-N > 400; not everyone has a seat.

The revenue is defined as:
revenue(m,n) = 400(400+m-N) if N >=m
revenue(m,n) = 400
(400+m-N)-1200(m-N) if N <m

Expected revenue is thus:
sum{n=0, 400+m)[Pr(N=n)*revenue(m,n)]

24
Q

Ordinary fair dice is thrown repeatedly until the number 5 is thrown. Find the number of throws X.
Find probability that 5 is obtained between 3rd and 8th throw.
Find probability that first 5 will occur within the first 10 throws

A

X:= number of trails until 5 is obtained
p_success = 1/6
E[X] = 1/p = 6, the mean number of throws it takes

P(3<X<8)= P(X=4)+P(X=5)+P(X=6)+P(X=7)
= (1-1/6)^(4-1)(1/6)

P(X<10) = 1-P(X>=10)
which means that we had 9 successive failures.
P(X<10) = 1-P(X>=10) = 1-(5/6)^9