Sampling Flashcards
Which real life events can be modelled as a coin flip
Any real-life binomial situation such as fraud/non-farud, buy/don’t buy,click/don’t click
What is a sample
A subset from a larger dataset
What is random sampling
Each member in population has equal chance of being picked during sampling procedure
What is stratified random sampling
You divide popullation into strata and randomly pick from each stratum. A stratum is a homogenous subgroup of a population with common characteristics, e.g Political pollsters might seek to learn the electoral preferences of whites, blacks, and Hispanics.
Sampling Bias
The sample is different from population it is supposed to represent in some meaningful way
Errors due to chance vs errors due to sampling
Picture shots at a target disk. Shots will not be centered around the target. An unbiased process will produce error, but it
is random and does not tend strongly in any direction. Whereas for a biased process—there is still random error in both the x
and y direction, but there is also a bias. Shots tend to fall in the upper-right quadrant.
When do you need large amount of data
When data is not only big but also sparse, e.g Google queries.
Data snooping
Extensive search through data for the hunt of something interesting.If you torture the data long enough, sooner or later it will confess.
Regression to Mean
successive measurements
on a given variable: extreme observations tend to be followed by more central ones.
Attaching special focus and meaning to the extreme value can lead to a form of selection bias.In nearly all major sports, at least those played with a ball or puck, there are two ele‐
ments that play a role in overall performance:
* Skill
* Luck
Regression to the mean is a consequence of a particular form of selection bias. When
we select the rookie with the best performance, skill and good luck are probably con‐
tributing. In his next season, the skill will still be there, but very often the luck will
not be, so his performance will decline—it will regress. Same for genetic tendencies; for example, the children of extremely tall men tend not to
be as tall as their father
Sample Statistic and sample variablility
A metric calculated on a sample. This metric might be different had we drawn a different sample, hence there is sampling variability. The larger the sample, the narrower the distribution of the sample statistic. And the distribution of the sample statistic (such as the mean) is more bell-shaped than the data itself.
Central Limit Theorem
Means drawn from multiple samples will resemble the familiar normal curve even if the source population is
not normally distributed, provided that the sample size is large enough and the
departure of the data from normality is not too great.
If you sufficiently select random samples from a population with mean μ and standard deviation σ, then the distribution of the sample means will be approximately normally distributed with mean μ and standard deviation σ/sqrt{n}
The central limit theorem
allows normal-approximation formulas like the t-distribution to be used in calculating sampling distributions for inference—that is, confidence intervals and hypothesis
tests.
Standard Error
Sums of variability of in sampling distribution of a statistic. It is defined as the division of standard deviations of the samples by the square root of the sample sizes n. As the sample size increases, the standard error decreases.
Square root of n rule
The relationship between standard error and sample size. To reduce the standard error by a factor of 2, the sample size must be increased by a factor of 4.
Bootstrapping
Effective way to estimate the sampling distribution of a statistic, or of
model parameters, is to draw additional samples, with replacement, from the sample
itself and recalculate the statistic or model for each resample.
Confidence Intervals
A 95 % confidence
interval confidence interval is defined as a range of values such that with 95 %
probability, the range will contain the true unknown value of the parameter. A 95% confidence interval for a parameter, is the estimated_parameter +/- 2*standard_error(parameter). Also, the smaller the sample, the wider the interval (i.e., the greater the uncertainty).