Statistics Flashcards

1
Q

What is the Central Limit Theorum?

A

The Central Limit Theorem states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution. This fact holds especially true for sample sizes over 30.

I.e. As you take more samples, especially large ones, your graph of the sample means will look more like a normal distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In statistics, What is Differentiable?

A

Differentiable means that a function has a derivative. In simple terms, it means there is a slope (one that you can calculate). This slope will tell you something about the rate of change: how fast or slow an event (like acceleration) is happening.

The derivative must exist for all points in the domain, otherwise the function is not differentiable. This might happen when you have a hole in the graph: if there’s a hole, there’s no slope (there’s a drop off!).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In statistics, What is Non-Differentiable?

A

In general, a function is not differentiable (you cannot calculate the slope) for four reasons:

  1. Corners,
  2. Cusps,
  3. Vertical tangents,
  4. Jump discontinuities.

You’ll be able to see these different types of scenarios by graphing the function on a graphing calculator;

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

In statistics, what does ‘differentiate’ mean?

A

When you differentiate (or “take the derivative”), you’re finding the slope of a function at a particular point. It tells you the rate of change (i.e. how fast or how slow something is changing).

For example, if you know the position of a car, you can use differentiation to tell you how fast the car is going at that point.

The major difference between differentiation and using the slope formula is that differentiation can give you all of the slopes at all of the points, while the slope formula can only give you one at a time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a Deterministic model?

A

Deterministic models are based on precise relationships between the input variables and the model’s outputs.

The inputs to a deterministic model are fixed and known with certainty.
Given the same set of inputs, a deterministic model will always produce the same output.

Deterministic models do not incorporate randomness or uncertainty in their calculations.
These models are suitable when the system being modeled is well-defined and the inputs are known precisely.

Examples of deterministic models include mathematical equations, linear regression models, and optimization models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a Stochastic model?

A

Stochastic models take into account randomness and uncertainty in the input variables or parameters, and produce probabilistic outcomes.

The inputs to a stochastic model are not fixed but rather described by probability distributions.

Given the same set of inputs, a stochastic model may produce different outputs due to the inherent randomness.

Stochastic models are appropriate when the system being modeled involves uncertainty or when the inputs are subject to variation.

Examples of stochastic models include Monte Carlo simulations, Markov chains, and certain types of machine learning models like Gaussian processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is standard deviation?

A

The standard deviation measures the amount of variation of a set of values.

It’s calculated by taking the square root of the sum of squared differences from the mean divided by the size of the data (square root of the variance)

In a normal distribution, about 95% of all values fall within two standard deviations of the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are univariate analysis?

A

Univariate analysis explores each variable in a data set, separately.

It looks at the range of values, as well as the central tendency of the values. It describes the pattern of response to the variable. It describes each variable on its own.

Descriptive statistics describe and summarize data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is bivariate analysis?

A

Bivariate analysis refers to analyzing two variables to determine relationships between them.

If there’s a relationship between two variables then there will be a correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a population in statistics?

A

In statistics, the population comprises all observations (data points) about the subject under study.

An example of a population is studying the voters in an election. In the 2019 Lok Sabha elections, nearly 900 million voters were eligible to vote in 543 constituencies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are measures of central tendency?

A

Measures of central tendency are the measures that are used to describe the distribution of data using a single value.

Mean, Median and Mode are the three measures of central tendency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In statistics, what is variance?

A

Variance is used to measure the variability in the data from the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In statistics, what is skewness?

A

Skewness measures the shape of the distribution.

A distribution is symmetrical when the proportion of data at an equal distance from the mean (or median) is equal. If the values extend to the right, it is right-skewed, and if the values extend left, it is left-skewed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In statistics, what is kurtosis?

A

Kurtosis in statistics is used to check whether the tails of a given distribution have extreme values. It also represents the shape of a probability distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Gaussian distribution?

A

In statistics and probability, Gaussian (normal) distribution is a popular continuous probability distribution for any random variable.

It is characterized by 2 parameters (mean μ and standard deviation σ)

Properties of Gaussian Distribution:

  • Mean, median, and mode are the same
  • Symmetrical bell shape
  • 68% data lies within 1 st dev of the mean
  • 95% data lie within 2 st devs of the mean
  • 99.7% of the data lie within 3 st devs of the mean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the p value?

A

The P stands for probability and measures how likely it is that any observed difference between groups is due to chance.

If the p-value > 0.05 - Accept the null hypothesis.

If the p-value < 0.05 - Reject the null hypothesis.

Some popular hypothesis tests are:

Chi-square test
T-test
Z-test
Analysis of Variance (ANOVA)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When is a t-test used and what are its assumptions?

A

A t-test is a type of statistical analysis used to compare the averages of two groups and determine if the differences between them more are likely to arise from chance

Assumptions: (CRINE)
- Continuous data
- Random sample
- Independent observations
- Normal distribution of data in each group
- Equal variance (i.e., the variability of the data in each group is similar - is this is not he case use Welch’s t-test).

Can be one-sample, 2-sample or paired

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is correlation?

A

Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate).

It’s a common tool for describing simple relationships without making a statement about cause and effect.

We describe correlations with a unit-free measure called the correlation coefficient which ranges from -1 to +1 and is denoted by r (closer to zero = less correlation, minus = negative correlation).

Statistical significance is indicated with a p-value. Therefore, correlations are typically written with two key numbers: r = and p = .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

In statistics, what is simple linear regression?

A

Simple linear regression is used to model the relationship between two continuous variables.

Often, the objective is to predict the value of an output variable (or response) based on the value of an input (or predictor) variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is multiple linear regression?

A

Multiple linear regression is used to model the relationship between a continuous response variable and continuous or categorical explanatory variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why should you do residual analysis?

A

To verify that assumptions drawn from regression modelling are valid.

Check that residuals:
1. Have constant variance (no non-random pattern)
2. Are approximately normally distributed
3. Are independent from one another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are externally studentized residuals?

A

A studentized residual is calculated by dividing the residual by an estimate of its standard deviation.

The standard deviation for each residual is computed with the observation excluded.

For this reason, studentized residuals are sometimes referred to as externally studentized residuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is Multicollinearity?

A

When two or more predictors are highly correlated with one another.

In regression, multicollinearity can make it difficult to determine the effect of each predictor on the response, and can make it challenging to determine which variables to include in the model.

Multicollinearity can also cause other problems:

  • Coefficients might be poorly estimated, or inflated.
  • Coefficients might have signs that don’t make sense.
  • Standard errors for these coefficients might be inflated.

To resolve, try:
- removing a redundant term from the model
- principal component analysis (PCA)
- partial least squares (PLS)
- tree-based methods
- penalized regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is One Way ANOVA?

A

One-way analysis of variance (ANOVA) is a statistical method for testing for differences in the means of three or more groups.

One-way ANOVA can only be used when investigating a single factor and a single dependent variable.

When comparing the means of three or more groups, it can tell us if at least one pair of means is significantly different, but it can’t tell us which pair.

Also, it requires that the dependent variable be normally distributed in each of the groups and that the variability within groups is similar across groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the Sum of Squares in statistics?

A

The sum of squares gives us a way to quantify variability in a data set by focusing on the difference between each data point and the mean of all data points in that data set.

A higher sum of squares indicates higher variability while a lower result indicates low variability from the mean.

To calculate the sum of squares, subtract the mean from the data points, square the differences, and add them together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is Degrees of Freedom in statistics?

A
  • Number of pieces of information that can freely vary
  • Without violating restrictions
  • Independent pieces of information available to estimate other pieces of information (variable features available)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is the Chi-squared test?

A

A hypothesis testing method involving checking if observed frequencies in one or more categories match expected frequencies.

If you have a single measurement variable, you use a Chi-square goodness of fit test.

If you have two measurement variables, you use a Chi-square test of independence. There are other Chi-square tests, but these two are the most common.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are Hidden Markov Models?

A

Statistical models used to describe and analyze sequential data, particularly data that exhibits a sequential or temporal dependence.

Particularly useful for modeling sequential data where the underlying states are not directly observable but have an impact on the observed data.

HMM assumes that the system transitions from one state to another according to a probabilistic process. State transitions are governed by transition probabilities, which determine the likelihood of moving from one state to another.

The underlying assumption of an HMM is the Markov property, which states that the probability of being in a particular state depends only on the immediately preceding state. In other words, the current state is assumed to be conditionally independent of all previous states given the most recent state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is a Monte Carlo Simulation?

A

A computational technique used to model and analyze systems or processes that involve uncertainty.

Relies on generating random samples or scenarios to estimate the behavior or outcomes of complex systems.

  1. Assign random values, or values based on probability distributions to uncertain parts of the model
  2. Run model on all combinations of values to show the range of potential outcomes (these are plotted on a probability distribution curve)
  3. Frequencies of the different outcomes should form a normal distribution:
    - the mean is the most likely outcome, with equal chance of it going either side
    - 68% chance true outcome will fall within 1 st dev of the mean
    - 95% chance true outcome will fall within 2 st dev of the mean

Limitations:
- Solely based on input data (random values, features etc)
- Computationally expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is Homoscedasticity

A

Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are Eigenvectors and Eigenvalues?

A

In linear algebra, an eigenvector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it.

The corresponding eigenvalue, often denoted by lambda, is the factor by which the eigenvector is scaled.

e.g. when you stretch or shear an image, the eigenvectors are the ‘axis’ along which all other points in the image will slide across, and not change direction/length themselves

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is Heteroskedasticity?

A

Heteroskedasticity (or heteroscedasticity) happens when the standard deviations of a predicted variable, monitored over different values of an independent variable or as related to prior time periods, are non-constant.

i.e. the variance of the residuals is unequal over a range of measured values.

If heteroskedasticity exists, the population used in the regression contains unequal variance, and the analysis results may be invalid

33
Q

What is Z Score?

A

The Z-score (also called the standard score) indicates how far away a certain point is from the mean. By applying Z-transformation we shift the distribution and make it 0 mean with unit standard deviation.

Z-score(i) = (x(i) - mean) / standard deviation

It assumes that the data is normally distributed and hence the % of data points that lie between -/+1 stdev. is ~68%, -/+2 stdev. is ~95% and -/+3 stdev. is ~99.7%.

Hence, if the Z-score is >3 we can safely mark that point to be an outlier.

34
Q

How can you handle outliers?

A
  1. Remove outlier values
  2. Try a different model. Data detected as outliers by linear models can be fit by nonlinear models. Therefore, be sure you are choosing the correct model.
  3. Try normalizing the data. This way, the extreme data points are pulled to a similar range.
  4. Use algorithms that are less affected by outliers; e.g random forests.
35
Q

When can a time series be described as stationary?

A

When the variance and mean of the series are constant with time.

36
Q

What are confounding variables?

A

In causal inference, a confounder is a variable that influences both the dependent variable and independent variable, causing a spurious association.

An unmeasured variable that influences both the supposed cause and effect.

37
Q

What is selection bias?

A

Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample.

38
Q

What are the types of biases that can occur during sampling?

A
  1. Selection bias
  2. Undercoverage bias
  3. Survivorship bias
39
Q

What is survivorship bias?

A

Survivorship bias is the logical error of focusing on aspects that support surviving a process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous ways.

40
Q

If you have two children and one is a boy, what is the probability that the other is a girl?

A
  1. Create space of events matrix:

Possibilities = FF, FM, MF, MM (25% chance of each)

  1. We know one child is already a male, so FF is impossible
  2. That leaves 3 possibilities: MM, MF, FM - therefore 66% chance that the second child is a girl
41
Q

If you throw a coin 1000x and 550x you get a head, how can you tell if the coin is fair?

A
  1. Central Limit Theorum dictates that the mean of a large number of Bernoulli trials will be normally distributed for a fair coin, with the highest probabilities centred around 500 throws
  2. Therefore normal distribution rules apply; 68% of data within 1SD of the mean, 95% data within 2SD etc
  3. Select:
    - p value threshold; 0.05 means 5% chance of false positive
    - alpha: Significance level (how often falsely reject null hypothesis - Type 1 error)
    - Minimal Detectable Effect: threshold for how different probability should be from 0.05 to consider coin biased
    - Power: sample size
  4. Calculate st dev of the normal distribution (sq rt of sum of variants)
  5. Calculate z score (z = (x-μ)/σ, where x is the raw score, μ is the population mean, and σ is the population standard deviation)
  6. Use Standard Normal Curve Areas table to look up probability of this occurring
42
Q

What is a Bernoulli trial?

A

a Bernoulli trial (or binomial trial) is a random experiment with exactly two possible outcomes

43
Q

Define probability

A

Represents the possibility of events to occur, generally measured by the ratio of favorable events to the total number of events possible.

A study and interpretation of chance of outcomes in the sample space of statistical experiments.

Always lies between 0 and 1

44
Q

What is Sample Space?

A

The set of all possible outcomes and is generally represented by an alphabet S.

Calculated by the number of possible outcomes to the power of number of events

e.g. 10 coin tosses:
possible outcomes = 2 (H/T)
number of events = 10
2^10 = 1024

45
Q

What is a Complement of event?

A

The Complement of Event A is generally represented by A’ and is a subset of all elements or events of S that are not in A, and it can be calculated from
P(A’) = 1 - P(A)

e.g. if event is 2 heads thrown in 10 coin tosses, the complement is when 1 or zero heads are thrown

46
Q

In statistics, what is an Independent Event?

A

An independent event does not affect the outcome of event B where A and B are the two events of sample space S.

In other words, the product of probability of events A and B equals to probability of intersection of events A and B.
P(A ∩ B) = P(A) P(B)

Tossing a coin or throwing a dice are the examples of independent events.

47
Q

In statistics, what are Mutual Events?

A

Two events A and B are said to be mutual if the sample space has at least one element in common to events A and B.

48
Q

In statistics, what are Mutually Exclusive Events?

A

Events are said to be disjoint or mutually exclusive, if two events A and B that can not both occur at the same time or it does not have any common elements.

For example, it is not possible to get both HEAD & TAIL, when a coin is tossed.
(A ∩ B) = Φ
P(A ∩ B) = Φ

49
Q

In statistics, What is conditional probability?

A

Conditional probability is the measure of probability of an event A given that the event of B has already occurred.

In other words, its a possibility of expected event to occur based on the occurrence of previous event.
P(A|B) = P(A ∩ B)/P(B)
similarly,
P(B|A) = P(A ∩ B)/P(A)

50
Q

In statistics, what is variance?

A

Variance measures variability from the average or mean.

It is calculated by taking the differences between each number in the data set and the mean, then squaring the differences to make them positive, and finally dividing the sum of the squares by the number of values in the data set.

51
Q

What are permutations in statistics?

A

Sequences of events where the order matters, e.g. a pin number

Permutations are important in probability calculations to find the sample space.

To calculate the number of permutations, take the number of possibilities for each event and then multiply that number by itself X times, where X equals the number of events in the sequence.

For example, with four-digit PINs, each digit can range from 0 to 9, giving us 10 possibilities for each digit. We have four digits. Consequently, the number of permutations with repetition for these PINs = 10 * 10 * 10 * 10 = 10,000.

52
Q

What are combinations in statistics?

A

A sequence of outcomes where the order does not matter.

For example, when you’re ordering a pizza, it doesn’t matter whether you order it with ham, mushrooms, and olives or olives, mushrooms, and ham.

53
Q

What is a Bernoulli distribution?

A

Probability distribution for a random variable with:

P(1) = p
P(0) = 1 - p

A model for the set of possible outcomes of any single experiment that asks a yes–no question.

(a special case of binomial distribution where a single trial is conducted)

e.g.
- click through probability, where click through rate = p
- conversion rate

54
Q

What is a binomial distribution?

A

A discrete probability distribution that calculates the likelihood that an event will occur a specific number of times in a set number of opportunities

e.g. 2 heads thrown in 10 coin tosses

55
Q

What is a small sample size and what special treatment does this need?

A

A small sample is <30

Small samples require testing for normal distribution

This is not required for large samples due to central limit theorum

56
Q

When would you use a z-test rather than a t-test?

A

A z-test is used to test a Null Hypothesis if the population variance is known, or if the sample size is larger than 30, for an unknown population variance.

A t-test is used when the sample size is less than 30 and the population variance is unknown.

57
Q

What are the differences between a student’s (t) distribution and a normal (z) distribution?

A
  1. The standard normal or z-distribution assumes that you know the population standard deviation. The t-distribution is based on the sample standard deviation.
  2. T distribution is more spread out
  3. T- distribution st dev is unknown
  4. T-distribution provides a wider confidence interval than a z-distribution (because we don’t know the st dev so are less certain about the estimate)
  5. As the n increases a t-distribution will start to approximate a normal distribution
  6. As the n increases a t-test result will be almost identical to z-test
58
Q

What is a t-test?

A

When you perform a t-test, you check if your test statistic is a more extreme value than expected from the t-distribution.

For a two-tailed test, you look at both tails of the distribution. Figure 3 below shows the decision process for a two-tailed test. The curve is a t-distribution with 21 degrees of freedom. The value from the t-distribution with α = 0.05/2 = 0.025 is 2.080. For a two-tailed test, you reject the null hypothesis if the test statistic is larger than the absolute value of the reference value. If the test statistic value is either in the lower tail or in the upper tail, you reject the null hypothesis. If the test statistic is within the two reference lines, then you fail to reject the null hypothesis.

59
Q

In statistics, what is Power?

A
  • Statistical power is used in a binary hypothesis test
  • It is the probability that a test correctly rejects the null hypothesis when the alternative hypothesis is true (likelihood that a statistical test will identify an effect when the effect is present)
  • The higher the statistical power, the better the test is
  • It is commonly used in experimental design to calculate the minimum sample size required
60
Q

Explain Type I error

A
  • Also known as false positive
  • Used to categorise errors in a binary hypothesis test
  • Occurs when mistakenly reject a true null hypothesis (occurs when find statistical significance when in fact results occurred by chance)
  • The larger a value, the less reliable a test is (want to minimize it)
  • Commonly used in A/B testing
61
Q

Explain Type II Error

A
  • Also known as false negatives
  • Used to categorise errors in a binary hypothesis test
  • Occurs when fail to reject False null hypothesis (conclude there is no significant results when there actually is)
  • The larger a value, the less reliable a test is (want to minimize it)
  • Commonly used in A/B testing
62
Q

Explain power, type I and Type II errors to a non-technical audience

A

In the context of covid testing

Power: person has a test, how confident are we of the result of this test

Type 1: Person does not have covid but test is positive

Type 2: Person has covid but test is negative

63
Q

Explain confidence interval

A
  • Used when want to understand how variable a sample result might be (estimates the true value of population although we will never know what this is)
  • CI is a range of numbers which should cover the true population value
    The probability of the confidence interval containing the true value is the confidence LEVEL - often 95%
  • The wider the interval, the more uncertain we are about the sample result; the higher the confidence LEVEL the wider the confidence INTERVAL
64
Q

Explain confidence interval to a non-technical audience

A

CI measures the level of uncertainty when predicting a value

Let’s say we want to know the average height of men in the US

We can measure a sample of 30 men

If have a CI of 168-195cm with confidence level of 95% we can assume that this CI covers the TRUE average height of all men in the US

But how likely?

If repeat the experiment multiple times we expect the CI to cover the true value 95% of the time

65
Q

Explain p-value

A
  • p value is used in hypothesis testing to connect the dots between observation and conclusion
  • It is a conditional probability measuring the probability of getting test results as extreme as observed results if null hypothesis is true
  • Low p value = less support for null hypothesis; often use 0.05 as cut off value - <0.05 reject null hypothesis
  • Commonly used in AB testing when want to see if true difference between two groups
66
Q

Explain p-value to a non-technical audience

A

p value is a measure of how likely our measurements are if there’s really no difference between two groups

If want to compare IQs of people from North and South UK

Get 30 people from each and measure IQs

Test difference of averages between groups:

p-value lets you connect dots between the data observed from sample and the true data

If we get a p value <0.05 it shows that it is likely that there is a difference between groups, more than 0.05 there is no difference

67
Q

What is central tendency?

A

Central tendency describes where most of the data lies in a distribution.

Use mean, median and mode to describe central tendecy

68
Q

What are pros and cons of using the difference central tendency measures?

A

Mean
Pro - uses all values
Con - sensitive to outliers

Median
Pro - robust to outliers
Con - only uses one value

Mode
Pro - useful for categorical variables
Con - only uses one value

69
Q

In statistics, what is dispersion?

A

Dispersion is the spread of data around the distribution

i.e. Variance

70
Q

What can you do to calculate correlation in the presence of extreme outliers?

A
  • Use interquartile range to remove outliers
  • Scaling (logarithmic etc)
71
Q

What are the assumptions of linear regression?

A

Can remember this with LINE:

  • L: Linear relationship exists between two variables
  • I: Independence of residuals, where one residual values does not influence another
  • N: Normal distribution of residuals (esp. important for small sample sizes)
  • E: Equal variance of residuals across different independent variable values
  • Residuals are differences between actual and predicted values
72
Q

What is Welch’s t-test and when is it used?

A

Welch’s t-test is an alternative to the student’s t-test which is used when there is a large difference in variance between the two groups

73
Q

What is the difference between correlation and covariance?

A

Correlation:
- Measure strength of the relationship between two variables
- Unitless
- Range: -1 : 1

Covariance:
- Measures the direction of the relationship
- Unit = product of the units of the two variables
- Range: -product of st dev of two variables: product of st dev of two variables

74
Q

In statistics, what is resampling and why do we use it?

A

Resampling is a non-parametric method

Consists of taking a sample from a sample; the original sample is then considered the population

It is used when want to test experiment but data is not normally distributed, parametric tests cannot be used, or more data cannot be collected

Assumes that samples of the original sample will create normal probability distribution

Two approaches: bootstrapping and permutation

75
Q

What are the two key methods of data resampling and how do they compare?

A

Bootstrapping:
- Take multiple samples from the original sample
- Used to estimate precision of sample statistic
- No assumptions
- With replacement
- Repeat 10,000 times

Permutation:
- Mix up two original sample groups, take random sample from mix and use this as new group 1, remainder as group 2
- Used for non-parametric hypothesis testing
- Assumes exchangeability of groups under null hypothesis
- Without replacement
- Repeat at least 1000 times

76
Q

What is the definition of data sampling?

A

Taking a subset of a population which is representative of that population

Used when do not have access to entire population (due to cost, efficiency etc)

77
Q

What is sampling with replacement vs sampling without replacement?

A

With replacement: Take element of sample, perform measures, replace element in sample before drawing next element
- possible to draw same element multiple times
- probability of elements getting drawn is constant

Without replacement: Elements are not returned to sample after being measured
- Cannot draw same element twice
- Probability of getting drawn increases as elements are taken out

78
Q

What is Bayes Theorum?

A

Statistical approach to dealing with conditional probability