Statistics Flashcards

Question

What is the Sum of Squares in statistics?

Answer 1

The sum of squares gives us a way to quantify variability in a data set by focusing on the difference between each data point and the mean of all data points in that data set. A higher sum of squares indicates higher variability while a lower result indicates low variability from the mean. To calculate the sum of squares, subtract the mean from the data points, square the differences, and add them together.

Answer 2

- Number of pieces of information that can freely vary - Without violating restrictions - Independent pieces of information available to estimate other pieces of information (variable features available)

Answer 3

A hypothesis testing method involving checking if observed frequencies in one or more categories match expected frequencies. If you have a single measurement variable, you use a Chi-square goodness of fit test. If you have two measurement variables, you use a Chi-square test of independence. There are other Chi-square tests, but these two are the most common.

Answer 4

Statistical models used to describe and analyze sequential data, particularly data that exhibits a sequential or temporal dependence. Particularly useful for modeling sequential data where the underlying states are not directly observable but have an impact on the observed data. HMM assumes that the system transitions from one state to another according to a probabilistic process. State transitions are governed by transition probabilities, which determine the likelihood of moving from one state to another. The underlying assumption of an HMM is the Markov property, which states that the probability of being in a particular state depends only on the immediately preceding state. In other words, the current state is assumed to be conditionally independent of all previous states given the most recent state.

Answer 5

A computational technique used to model and analyze systems or processes that involve uncertainty. Relies on generating random samples or scenarios to estimate the behavior or outcomes of complex systems. 1. Assign random values, or values based on probability distributions to uncertain parts of the model 2. Run model on all combinations of values to show the range of potential outcomes (these are plotted on a probability distribution curve) 3. Frequencies of the different outcomes should form a normal distribution: - the mean is the most likely outcome, with equal chance of it going either side - 68% chance true outcome will fall within 1 st dev of the mean - 95% chance true outcome will fall within 2 st dev of the mean Limitations: - Solely based on input data (random values, features etc) - Computationally expensive

Answer 6

Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables.

Answer 7

In linear algebra, an eigenvector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted by lambda, is the factor by which the eigenvector is scaled. e.g. when you stretch or shear an image, the eigenvectors are the 'axis' along which all other points in the image will slide across, and not change direction/length themselves

Answer 8

Heteroskedasticity (or heteroscedasticity) happens when the standard deviations of a predicted variable, monitored over different values of an independent variable or as related to prior time periods, are non-constant. i.e. the variance of the residuals is unequal over a range of measured values. If heteroskedasticity exists, the population used in the regression contains unequal variance, and the analysis results may be invalid

Answer 9

The Z-score (also called the standard score) indicates how far away a certain point is from the mean. By applying Z-transformation we shift the distribution and make it 0 mean with unit standard deviation. Z-score(i) = (x(i) - mean) / standard deviation It assumes that the data is normally distributed and hence the % of data points that lie between -/+1 stdev. is ~68%, -/+2 stdev. is ~95% and -/+3 stdev. is ~99.7%. Hence, if the Z-score is >3 we can safely mark that point to be an outlier.

Answer 10

1. Remove outlier values 2. Try a different model. Data detected as outliers by linear models can be fit by nonlinear models. Therefore, be sure you are choosing the correct model. 3. Try normalizing the data. This way, the extreme data points are pulled to a similar range. 4. Use algorithms that are less affected by outliers; e.g random forests.

Answer 11

When the variance and mean of the series are constant with time.

Answer 12

In causal inference, a confounder is a variable that influences both the dependent variable and independent variable, causing a spurious association. An unmeasured variable that influences both the supposed cause and effect.

Answer 13

Selection bias, in general, is a problematic situation in which error is introduced due to a non-random population sample.

Answer 14

1. Selection bias 2. Undercoverage bias 3. Survivorship bias

Answer 15

Survivorship bias is the logical error of focusing on aspects that support surviving a process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous ways.

Answer 16

1. Create space of events matrix: Possibilities = FF, FM, MF, MM (25% chance of each) 2. We know one child is already a male, so FF is impossible 3. That leaves 3 possibilities: MM, MF, FM - therefore 66% chance that the second child is a girl

Answer 17

1. Central Limit Theorum dictates that the mean of a large number of Bernoulli trials will be normally distributed for a fair coin, with the highest probabilities centred around 500 throws 2. Therefore normal distribution rules apply; 68% of data within 1SD of the mean, 95% data within 2SD etc 3. Select: - p value threshold; 0.05 means 5% chance of false positive - alpha: Significance level (how often falsely reject null hypothesis - Type 1 error) - Minimal Detectable Effect: threshold for how different probability should be from 0.05 to consider coin biased - Power: sample size 4. Calculate st dev of the normal distribution (sq rt of sum of variants) 5. Calculate z score (z = (x-μ)/σ, where x is the raw score, μ is the population mean, and σ is the population standard deviation) 6. Use Standard Normal Curve Areas table to look up probability of this occurring

Answer 18

a Bernoulli trial (or binomial trial) is a random experiment with exactly two possible outcomes

Answer 19

Represents the possibility of events to occur, generally measured by the ratio of favorable events to the total number of events possible. A study and interpretation of chance of outcomes in the sample space of statistical experiments. Always lies between 0 and 1

Answer 20

The set of all possible outcomes and is generally represented by an alphabet S. Calculated by the number of possible outcomes to the power of number of events e.g. 10 coin tosses: possible outcomes = 2 (H/T) number of events = 10 2^10 = 1024

Answer 21

The Complement of Event A is generally represented by A’ and is a subset of all elements or events of S that are not in A, and it can be calculated from P(A’) = 1 - P(A) e.g. if event is 2 heads thrown in 10 coin tosses, the complement is when 1 or zero heads are thrown

Answer 22

An independent event does not affect the outcome of event B where A and B are the two events of sample space S. In other words, the product of probability of events A and B equals to probability of intersection of events A and B. P(A ∩ B) = P(A) P(B) Tossing a coin or throwing a dice are the examples of independent events.

Answer 23

Two events A and B are said to be mutual if the sample space has at least one element in common to events A and B.

Answer 24

Events are said to be disjoint or mutually exclusive, if two events A and B that can not both occur at the same time or it does not have any common elements. For example, it is not possible to get both HEAD & TAIL, when a coin is tossed. (A ∩ B) = Φ P(A ∩ B) = Φ

Answer 25

Conditional probability is the measure of probability of an event A given that the event of B has already occurred. In other words, its a possibility of expected event to occur based on the occurrence of previous event. P(A|B) = P(A ∩ B)/P(B) similarly, P(B|A) = P(A ∩ B)/P(A)

Answer 26

Variance measures variability from the average or mean. It is calculated by taking the differences between each number in the data set and the mean, then squaring the differences to make them positive, and finally dividing the sum of the squares by the number of values in the data set.

Answer 27

Sequences of events where the order matters, e.g. a pin number Permutations are important in probability calculations to find the sample space. To calculate the number of permutations, take the number of possibilities for each event and then multiply that number by itself X times, where X equals the number of events in the sequence. For example, with four-digit PINs, each digit can range from 0 to 9, giving us 10 possibilities for each digit. We have four digits. Consequently, the number of permutations with repetition for these PINs = 10 * 10 * 10 * 10 = 10,000.

Answer 28

A sequence of outcomes where the order does not matter. For example, when you’re ordering a pizza, it doesn’t matter whether you order it with ham, mushrooms, and olives or olives, mushrooms, and ham.

Answer 29

Probability distribution for a random variable with: P(1) = p P(0) = 1 - p A model for the set of possible outcomes of any single experiment that asks a yes–no question. (a special case of binomial distribution where a single trial is conducted) e.g. - click through probability, where click through rate = p - conversion rate

Answer 30

A discrete probability distribution that calculates the likelihood that an event will occur a specific number of times in a set number of opportunities e.g. 2 heads thrown in 10 coin tosses

Answer 31

A small sample is <30 Small samples require testing for normal distribution This is not required for large samples due to central limit theorum

Answer 32

A z-test is used to test a Null Hypothesis if the population variance is known, or if the sample size is larger than 30, for an unknown population variance. A t-test is used when the sample size is less than 30 and the population variance is unknown.

Answer 33

0. The standard normal or z-distribution assumes that you know the population standard deviation. The t-distribution is based on the sample standard deviation. 1. T distribution is more spread out 2. T- distribution st dev is unknown 3. T-distribution provides a wider confidence interval than a z-distribution (because we don't know the st dev so are less certain about the estimate) 4. As the n increases a t-distribution will start to approximate a normal distribution 5. As the n increases a t-test result will be almost identical to z-test

Answer 34

When you perform a t-test, you check if your test statistic is a more extreme value than expected from the t-distribution. For a two-tailed test, you look at both tails of the distribution. Figure 3 below shows the decision process for a two-tailed test. The curve is a t-distribution with 21 degrees of freedom. The value from the t-distribution with α = 0.05/2 = 0.025 is 2.080. For a two-tailed test, you reject the null hypothesis if the test statistic is larger than the absolute value of the reference value. If the test statistic value is either in the lower tail or in the upper tail, you reject the null hypothesis. If the test statistic is within the two reference lines, then you fail to reject the null hypothesis.

Answer 35

- Statistical power is used in a binary hypothesis test - It is the probability that a test correctly rejects the null hypothesis when the alternative hypothesis is true (likelihood that a statistical test will identify an effect when the effect is present) - The higher the statistical power, the better the test is - It is commonly used in experimental design to calculate the minimum sample size required

Answer 36

- Also known as false positive - Used to categorise errors in a binary hypothesis test - Occurs when mistakenly reject a true null hypothesis (occurs when find statistical significance when in fact results occurred by chance) - The larger a value, the less reliable a test is (want to minimize it) - Commonly used in A/B testing

Answer 37

- Also known as false negatives - Used to categorise errors in a binary hypothesis test - Occurs when fail to reject False null hypothesis (conclude there is no significant results when there actually is) - The larger a value, the less reliable a test is (want to minimize it) - Commonly used in A/B testing

Answer 38

In the context of covid testing Power: person has a test, how confident are we of the result of this test Type 1: Person does not have covid but test is positive Type 2: Person has covid but test is negative

Answer 39

- Used when want to understand how variable a sample result might be (estimates the true value of population although we will never know what this is) - CI is a range of numbers which should cover the true population value The probability of the confidence interval containing the true value is the confidence LEVEL - often 95% - The wider the interval, the more uncertain we are about the sample result; the higher the confidence LEVEL the wider the confidence INTERVAL

Answer 40

CI measures the level of uncertainty when predicting a value Let's say we want to know the average height of men in the US We can measure a sample of 30 men If have a CI of 168-195cm with confidence level of 95% we can assume that this CI covers the TRUE average height of all men in the US But how likely? If repeat the experiment multiple times we expect the CI to cover the true value 95% of the time

Answer 41

- p value is used in hypothesis testing to connect the dots between observation and conclusion - It is a conditional probability measuring the probability of getting test results as extreme as observed results if null hypothesis is true - Low p value = less support for null hypothesis; often use 0.05 as cut off value - <0.05 reject null hypothesis - Commonly used in AB testing when want to see if true difference between two groups

Answer 42

p value is a measure of how likely our measurements are if there's really no difference between two groups If want to compare IQs of people from North and South UK Get 30 people from each and measure IQs Test difference of averages between groups: p-value lets you connect dots between the data observed from sample and the true data If we get a p value <0.05 it shows that it is likely that there is a difference between groups, more than 0.05 there is no difference

Answer 43

Central tendency describes where most of the data lies in a distribution. Use mean, median and mode to describe central tendecy

Answer 44

Mean Pro - uses all values Con - sensitive to outliers Median Pro - robust to outliers Con - only uses one value Mode Pro - useful for categorical variables Con - only uses one value

Answer 45

Dispersion is the spread of data around the distribution i.e. Variance

Answer 46

- Use interquartile range to remove outliers - Scaling (logarithmic etc)

Answer 47

Can remember this with LINE: - L: Linear relationship exists between two variables - I: Independence of residuals, where one residual values does not influence another - N: Normal distribution of residuals (esp. important for small sample sizes) - E: Equal variance of residuals across different independent variable values * Residuals are differences between actual and predicted values

Answer 48

Welch's t-test is an alternative to the student's t-test which is used when there is a large difference in variance between the two groups

Answer 49

Correlation: - Measure strength of the relationship between two variables - Unitless - Range: -1 : 1 Covariance: - Measures the direction of the relationship - Unit = product of the units of the two variables - Range: -product of st dev of two variables: product of st dev of two variables

Answer 50

Resampling is a non-parametric method Consists of taking a sample from a sample; the original sample is then considered the population It is used when want to test experiment but data is not normally distributed, parametric tests cannot be used, or more data cannot be collected Assumes that samples of the original sample will create normal probability distribution Two approaches: bootstrapping and permutation

Answer 51

Bootstrapping: - Take multiple samples from the original sample - Used to estimate precision of sample statistic - No assumptions - With replacement - Repeat 10,000 times Permutation: - Mix up two original sample groups, take random sample from mix and use this as new group 1, remainder as group 2 - Used for non-parametric hypothesis testing - Assumes exchangeability of groups under null hypothesis - Without replacement - Repeat at least 1000 times

Answer 52

Taking a subset of a population which is representative of that population Used when do not have access to entire population (due to cost, efficiency etc)

Answer 53

With replacement: Take element of sample, perform measures, replace element in sample before drawing next element - possible to draw same element multiple times - probability of elements getting drawn is constant Without replacement: Elements are not returned to sample after being measured - Cannot draw same element twice - Probability of getting drawn increases as elements are taken out

Answer 54

Statistical approach to dealing with conditional probability

Statistics Flashcards

(78 cards)