Basics Flashcards

1
Q

Summation / Sigma Notation

A

This is the sigma symbol: ∑
It tells us that we are summing something.

n is the summation index - When evaluating the expression we substitute different values for the index.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Frequency tables - lists - dot plots

A

Used to represent a single variable.

A list is just a list of variable values

A Frequency Table is a table showing each value and how often it occurs

A dot plot is a visual frequency table, with the variable value on the x and the frequency on the y.

These are all ways of representing the same info.
Once the data is organized we can start to analyze it with summary stats etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Histogram

A

Used to represent a single variable

Like a bar chart but both the x an y axis are numerical.
x-axis = intervals
y-axis = absolute frequency of each interval.

The bars will be touching to show that one interval beings where the other ends.

Instead of just plotting out the frequency of each discrete value, like a frequency table or dot plot, a histogram arranged the data in categories and then shows how many values fall within the category. The categories are often called buckets or bins.

Bins should not overlap

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Descriptive statistics

A

Ways of describing data without just providing the raw data. It’s about describing the data with a smaller set of numbers.

This would include thinks like summary statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Inferential Statistics

A

Ways of gaining insight from the data set and figuring out what the data means. How can we use the data to understand what the population value might be.

The key to inferential statistics is understanding that samples do not always accurately reflect the population they came from.

A large part of inferential statistics is quantifying our uncertainty about a population by looking at a smaller sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Average/ Central Tendency

A

Average = Typical or middle value of a data set. The “central tendency” of the data

Common types:
Mean
Median
Mode

The ‘best’ measure of central tendency will depend on which measure best represents the actual data and how it is skewed (or not).

All measures should be used in combination to understand the data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Median

A

The middle number of the data set when the set is placed in numerical order. If there is an even number of values, you take the mean of the two middle numbers.

The median is useful if there are outliers that will skew the mean and make it misleading.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Mode

A

The most common number in the data set. If there is no most common number then there is no mode.

Typically the least used measure of central tendency

The mode is useful if there are outliers that skew the mean and if there is a single number that shows up a lot.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Location of central tendency and skewness

A

In symmetrical distributions the mean, median and mode are identical or very close.

In left skewed distributions the mean is typically to the left of the median, which is to the left of the mode.

In right skewed distributions the mean is typically to the right of the median, which is to the left of the mode.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Left and right skew

A

A left skew means the tail/ outliers are to the left
A right skew means the tail/ outliers are on the right.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Interquartile Range (IQR)

A

The IQR is a measure of how spread out the data is.

(IQR) is the distance between the first and third quartile marks (25th to 75th percentile).

The IQR is a measurement of the variability about the median.

IQR tells us the range of the middle half of the data.

To find the IQR:
1. Find the median of the data set
2. Find the median of each set of numbers on either side of the median number. The IQR is the difference between these two numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Outliers

A

The definition of what is reasonably an outlier is subject to some interpretation based on the specific qualities of the data set.

Common definition:
An outlier is any number that is more than 1.5x the interquartile range below Q1 or above Q3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Sample Mean

A

Calculated the same way as the population mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Measures of Variability - Univariate

A

Variance
Standard Deviation
Coefficient of Variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Sample Variance

A

Variance is a measure of the spread or dispersion of a set of data points around their mean. It quantifies how much the individual data points deviate from the average.

Sample variance is generally a pretty good statistic in terms of approximating the true variance of the population.

A better approximation of the population parameter can usually be gained by dividing by n-1.

This approximation is AKA ‘The unbiased sample variance’.

Dividing by just n will tend to underestimate the population variance.

Dividing by n is fine if you just want the varianve/SD of the sample itself.

S2 or s squared symbol

s² = ∑(x - x̄)² / (n - 1)

Using n-1 instead of n is AKA Bessels Correction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Standard Deviation

A

Measure of the dispersion or spread of a set of data points around their mean. It is closely related to variance but is expressed in the same units as the original data, making it easier to interpret and compare.

The standard deviation is the square root of:
The population variance
OR
The unbiased sample variance (S^2/ n-1)

The square root of the sample variance (AKA the sample standard deviation) will not be an unbiased approximation of the population standard deviation.
This is because the square root function is non-linear.

SD is Written as:
s
std(x) - SD of random variable x
lowercase sigma

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Variance - Interpretation

A

Variance is always non-negative

Interpreting variance involves considering the magnitude of the variance value and its relationship to the data set.

Consider:
Magnitude - Higher variance means more dispersion from the mean

Units - Variance is expressed in squared units of the original data. Restore the original units by converting to standard deviation

Comparison - If one data set has a significantly higher variance than another, it implies that the observations in the first data set are more widely scattered.

Outliers - Variance is sensitive to outliers, they can inflate the variance, making it a less reliable measure of dispersion

Limitations:
Variance does not tell us the direction of variations from the mean.

It treats positive and negative differences equally.

Not robust in non-normal - heavily skewed data sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Coefficient of Variation(CV)

A

AKA relative standard deviation.

Calculated by standard deviation divided by the mean value. It’s just the standard deviation relative to the mean.

There are separate formulas for population and sample data for this measurement as well.

CV is used to compare the variation of two different data sets.

It will return a number that is not in units and is directly comparable across data sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Standard Deviation - Interpretation

A

Better for comparisons between datasets because the units are normalized.

Quantifies the typical amount of variation or “typical distance” of data points from the average.

Consider:

Magnitude - Higher SD means data points are spread farther from the mean.

Units - SD is expressed in the same units as the original data. This makes interpretation and comparison easier.

Range - SD provides a useful range around the mean (68-95-99.7). It helps us visualize where the data is falling using a single number.

Comparison - Comparing the standard deviations of different data sets allows you to assess their relative spread.

Outliers - SD is sensitive to outliers. Outliers, which are extreme values, can have a significant impact on the standard deviation.

Limitations - SD assumes a normal or symmetrical distribution. Is the data has a heavy skew - other measures might be more appropriate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Mean vs. Median as Central Tendency

A

The measures work in pairs:

More symmetrical Data:
Mean = central tendency
Standard Deviation = Spread

More Skewed Data:
Median = central tendency
IQR = Spread

Outlier values will move the mean quite a lot but they don’t effect the median, which depends on the sample size, not the actual data point values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Z-scores

A

One of the most common measures in statistics.

A Z-score tells you how many standard deviations away from the population mean a given data point is.

This helps you tell how usual or unusual a data point is.

This can be useful for comparing data points from different distributions. The scales are different but the relative position to the mean can still be compared.

To calculate the Z-score for a data point x:
(x-µ) / standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Z-scores - Interpretation

A

The Z-score of a data point tells you how many standard deviations from the mean the point is.

A negative Z-score indicates that the data point is below the mean, a positive Z-score indicates that the data point is above the mean

A z-score of 0 means the data point is equal to the mean.
A z-score of 1 means the data point is one standard deviation above or below the mean.
A z-score of 2 indicates it is two standard deviations away, and so on.

Typically, data points with z-scores greater than 3 or less than -3 are considered extreme outliers

You can use a table to find the percentile of a data point given it’s z-score

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Empirical Rule and Normal Distributions

A

The Empirical Rule is AKA the 68-95-99.7 Rule

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Marginal Distribution

A

The distribution formed by the totals of a single variable in a two way table. This data can be represented as numbers or as percentages.

Look at the margins of the table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Conditional Distribution

A

In a 2 way table

The distribution of one variable given that some condition with the other variable is met.

Conditional distributions are generally represented as percentages.
As in “What percent of men prefer basketball”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Scatter Plots

A

Scatter plots are used to plot bivariate relationships.

Being able to fit a line to the data is a good way to determine the strength of the relationship between the two variables.

The closer the line matches the data, the stronger the relationship.

The realtionship can be linear or non-linear.
In a linear relationship the variables are changing at roughly the same constant rate.
If it’s non-linear, the rate of change varies in different parts of the distribution.

A line with a negative slope indicates a negative relationship between the two variables.

A positive slope indicates a positive relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Correlation Coefficient

A

Used when there are 2 or more variables

Denoted by the variable r
AKA the Pearson correlation coefficient

Range from -1 to 1

r = 1 is a perfect positive correlation
r = 0 indicates no correlation
r = -1 is a perfect negative correlation

The value that is considered significant varies based on the field.
Social sciences = |0.3|+
Hard Sciences = |0.7|+

r values don’t quantify statistical significance - only correlation

Calculated as the covariance divided by the product of the standard deviations of the two variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Linear Regression & Least Squares

A

Linear regression - The process of finding the line that best fits a set of data

The most common method is to try to fit a line that minimizes the square distance to each point in the data set. This is a “least-squares” regression.

The equation for a linear regression is written as y=mx+b but the y will have a ˆ over it. This indicates that the y value is an estimated value. It can’t be an exact value because all the stat points will not sit directly on the line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Residuals

A

A residual is the difference between the actual value of a data point and the estimated value provided by the linear regression.

For a given x value, the residual is the actual value (y) minus the estimated or predicted value (yˆ)

A negative residual means the actual value is below the estimated value.

A positive residual means the actual value is above the estimated value.

The process of finding a line of best fit is about minimizing the square of the residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Residual Plots

A

A plot of the residuals in a data set. The x values stay the same as the data set but the y values become the residuals of the data set values.

Residual plots are used to gauge whether a line is a good fit for a data set or not.

A good fit will be indicated by the residual points being clustered above and below a y value of 0.

You don’t want to see trends in the residual data. If there is a trend, you might need a better linear regression line or you might need a non-linear regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Experiment

A

Involves a dependent and independent variables with control and experimental/ treatment groups.

You look for statistically significant differences between the treatment and control groups.

The independent variable (x) is AKA the explanatory variable.

The dependent variable (y) is AKA the response variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Observational Study

A

Involves collecting data and looking for existing patterns and correlations.

Observational studies can identify correlation but not causation between variables.

There are different types of observational studies

Data can be backward looking, forward looking, or based on information gathered right now.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Retrospective study

A

Samples past data to gain insights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Prospective study

A

Pick a sample and track the data from that sample over time. You can analyze the data at the end of some time period or as it is collected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Sample Survey

A

Involves taking a sample of data from a given population and gathering information on the state of things right now.

Voter preferences is a good example of this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Longitudinal study

A

Can be prospective or retrospective

Involves collecting data from the same group of individuals or subjects over an extended period.

The primary goal of a longitudinal study is to observe and analyze changes and trends that occur over time within the same individuals or groups.

Researchers typically make repeated measurements or observations at multiple time points. This allows them to examine the long-term effects, developmental patterns, or causal relationships between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Cross-sectional study

A

data is collected from different individuals or groups at a specific point in time.

Unlike longitudinal studies, cross-sectional studies focus on a single time point and aim to gather information about the characteristics, behaviors, or opinions of different individuals or groups at that particular moment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Survey Bias Types

A

Response Bias
Under coverage
Voluntary response sampling
Convenience Sampling
Non-response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Response Bias

A

Question phrasing or the question itself makes it unlikely that people will answer truthfully.

Ex. Have you lied to your parents in the past week? Have you ever cheated on your spouse?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Under coverage

A

Responses don’t take into account a key constituency. Calling 100 random people in the phone book when cell phones are not included in the phone book. There might be something different about people who only own cell phones or who have chosen to be unlisted in the phone book.

Under coverage will typically underestimate the % of the population with a given response.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Voluntary response sampling

A

Non-random sampling caused by respondents self selecting to complete a survey.

Voluntary response bias will typically overestimate the % of the population with a given response.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Convenience Sampling

A

Using a non-random sample because it’s available to you. Typically will overestimate the percent of the population with a given response.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Non-response

A

Lack of data can cause a source of bias if it’s big enough

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Simple random sampling

A

Throw the population in a bowl and have a blindfolded person pick a sample of the total population

Use a random number generator to pick members of the population. Put the population data set in alphabetical order and assign each entry a number. Then randomly generate numbers for your sample size and match them with the population data points.

Use a random digit table to pick out random numbers.

You can’t just think up numbers, you are not capable of being truly random.

Simple random samples can inadvertently introduce bias by randomly selecting a non-representative sample.
You can avoid this with Stratified Sampling and Clustered Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Stratified sample

A

Type of random sampling

Take the entire population and break it into strata, or different groups and randomly sampling each strata.

In a high school this might mean breaking the student population into freshman, sophomores, juniors and seniors and then taking a random sample of 25% of your total desired sample size from of each group.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Clustered Sample

A

Type of random sampling

Divide the population into groups that are broadly representative and then randomly sample the groups.

An example of this would be randomly sampling classrooms that have a generally representative mix of men and women.
You randomly pick the classroom and survey everyone in it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Voluntary sampling

A

Type of non-random sample

Bias is introduced because people are self selecting to fill out the survey

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Convenience Sampling

A

Type of non-random sample

Bias is introduced when the most convenient sample does not happen to be representative. The first 100 people in the door are convenient, but may not represent the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Systematic Random Sampling

A

Can be used when simple random sampling isn’t logistically feasible.
Consists of randomly sampling a sub-set of the population.

Given a desired sample size of 100
You pick the first subject at random and then sample every 100th person after that initial , randomly picked, person.

Systematic random sampling is not fool proof. There can still be bias if you’re not careful. You need to be sure that the sample isn’t being distorted in some way so that the person chosen by the interval is truly random.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Experiment Design

A

An experiment has an explanatory variable and a response variable.
The explanatory variable (x) causes the change in the response variable (y).

Experiments use randomly selected samples to infer the characteristics of the population as a whole
The random sample will then be split into the control/ treatment group in some way.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Control group

A

Portion of the sample that will not be manipulated.

52
Q

Treatment group

A

Portion of the sample that will be manipulated, will undergo treatment etc.

Allocation to the control and treatment groups can be random but there are other methods as well depending on the experiment.

53
Q

Block design

A

Experiment Design Type

Used to ensure that the control and treatment groups have to correct proportions of a specific characteristic.

The specific characteristic is selected randomly but the number of random selections is controlled to make sure the proportions are correct in both treatment and control groups.

54
Q

Matched Pairs Design

A

Experiment Design Type

Participants having the same characteristics get grouped into pairs, then within each pair, 1 participant gets randomly assigned to either the treatment or the control group and the other is automatically assigned to the other group.

You assign people to the treatment/ control groups and run the experiment once.

Then you switch the groups and run the experiment again.

Doing this helps mitigate unplanned for bias resulting from how the original group assignments were made.

Once the groups are assigned, you measure the baseline in each group of the variable you’re interested in before you start the experiment.

Once baseline is established. You give treatment to the treatment group and a placebo to the control group.

55
Q

Blind experiments

A

The subjects don’t know if they are in the treatment or control group.

56
Q

Double Blind experiments

A

Both subjects and experimenters are both ignorant of which group a particular participant is in.

57
Q

Triple Blind experiments

A

Subjects and experimenters and post experiment analysts are all ignorant of which group a particular participant is in.

58
Q

Post Experiment steps

A

After the experiment you remeasure the baseline and compare it to the pre-experiment baseline.

If there is a change in baseline, you then determine the strength of the change and the correlation between the explanatory variable and the response variable. If there is a strong relationship you can start to talk about causality.

Replication of the experiment is vital to reinforcing the ideal that there is a relationship between the variables. An individual experiment is seldom perfect.

59
Q

Statistical Significance

A

There is always a chance that the result of an experiment occurred on accident, and not as a result of the experiment itself. You always need to ask if the result you’re seeing is occurring just by chance.

One way to account for this is to re-randomize the results repeatedly and look for how many times the result of the experiment appears in the re-randomized data. If it appears often, that indicates that it could happen at random. If it’s very rare, it indicates that the chances of the experiment results happening by chance are low.

In most experiments, it’s considered significant if the odds of the result occurring by chance are less than 5%.

If a result is statistically significant, it indicates that there may be a causal relationship

https://youtu.be/jLFeqQxGtOc

60
Q

Random Samples and Random Assignments

A
61
Q

Re-randomization

A

Given a treatment group and a control group
and a group of 100 subjects each with a value for the dependent variable.

You would take the 100 subjects and randomly assign them to a new group. You do this many many times and record the result of the random assignments in terms of the dependent variable.

Compare the plotted randomized results to your actual experimental result and see how often your experimental result occurs in the distribution of the randomized chance results

If the actual result does not occur often by chance - it may be statistically significant.

62
Q

Additional Rule for Probability

A

Event A is rolling an odd number on a six-sided die
Event B is rolling a number greater than two.

P(A or B) = P(A) + P(B) - P(A and B)

You can’t simply add the probabilities because you’ll be double counting objects that fall into both sets A and B.

We subtract the intersection of events A and B because it is included twice in the addition of P(A) and P(B).

If there is no overlap in the sets then P(A and B) is 0 because the events are mutually exclusive. So in this case

P(A or B) = P(A) + P(B)

63
Q

Multiplication Rule - Probability of independent events

A

Independent events do not affect one another and do not increase or decrease the probability of another event happening.

We say two events are independent if knowing one event occurred doesn’t change the probability of the other event.

The probability of two independent events occurring is P(Event #1) * P(Event #2)

Two events, A and B, are independent if:
P(A|B) = P(A)
and
P(B|A) = P(B)

The general formula for two events to occur simultaneously is:
P(A and B) = P(A) * P(B|A)

64
Q

Experimental vs Theoretical Probability

A

Experimental probability should always be viewed with skepticism.

More experiments means more data and a closer approximation to theoretical probability, but experiments are never perfect.

65
Q

Probability of dependent events

A

Dependent events influence the probability of other events – or their probability of occurring is affected by other events.

The probability of two dependent events occurring is:
P(B|A) = P(A and B) / P(A)
OR
P(A and B) = P(B|A) * P(A)

To determine if two events are dependent, you need to ask if one event occurring makes the other event more or less likely to occur.

In practice, we often assume that events are independent and test that assumption on sample data. If the probabilities are significantly different, then we conclude the events are not independent.

66
Q

Conditional Probability - Bayes Theorem

A

Conditional probability refers to the probability of an event A occurring given that another event B has already occurred.

Calculated by:
P(A|B) = P(B|A) * P(A) / P(B)
AKA
P(A|B) = P(A and B) / P(B)

The conditional probability takes into account the information provided by event B and adjusts the probability of event A accordingly.

All conditional probability problems can be solved by growing trees

If events are independent then
P(A|B) = P(A)
P(B|A) = P(B)

67
Q

Permutations

A

In permutations order matters.
The permutation 1234 is not the same as the permutation 4321
The number of possible permutations P when trying to put n objects in r slots is n! / (n-r)! This assumes that none of the values repeat.
This would be written nPr

You may have to reason through the situation the formula is only a guide. It won’t work in all cases.

0! is defined as 1
This is so that nPr makes sense if r = n

68
Q

Combinations

A

In combinations, order doesn’t matter.
The combination 1234 is the same as 4321.

For the same set of items, there are far more permutations than combinations

nCr is the number of combinations for n items in r slots
nCr = nPr / r!

This is AKA ‘n choose k’ or the “binomial coefficient”.

69
Q

Probability and Combinations

A

Combinations are important for probability because the probability of an event occurring is the number of possible occurrences divided by the number of total outcomes.

The number of occurrences is calculated by finding how many combinations of the possible outcomes will result in the event you want the probability for.

So given and event E and the total number of outcomes Z. The number of times the event can occur would be ZCE. This assumes that all the events are equally likely, as with a coin flip.

70
Q

Why we need statistics

A

Populations are hard to define and hard to determine in real life. Very difficult to work with.

Statistics exist because we almost never have population data. Even when we have it, it can be too much info to work with effectively.

Samples are much easier wit work with and most data you will work with will be sample data

71
Q

Population

A

Collection of all items of interest.

Denoted as N.

Numbers obtained when using a population are parameters

72
Q

Sample

A

A subset of the population.

Denoted as n.

Numbers obtained when using a population are statistics

Samples must be both random and representative for an insight to be precise.

Random means that each member of the sample is chosen strictly by chance

Representative means that the sample accurately reflects the members of the entire population.

73
Q

Data types and classifications

A

Data can be classified in two main ways, the data type and the data measurement level.

Types = Categorical or Numerical

Measurement = Discrete or Continuous
*Time on a clock is discrete, but time in general is continuous

Qualitative = Nominal or Ordinal

Quantitative = Interval or Ratio

74
Q

Qualitative Data

A

Can be nominal or ordinal

Nominal = Categories (seasons, car companies) these are not numbers and can’t be ordered

Ordinal = group and categories that follow a strict order. Ratings from negative to positive for example.

75
Q

Quantitative

A

Can be interval or ratio.
Both represent numbers
Normal numbers can be both interval or ratio depending on the context.

Interval : don’t have a true 0 and can’t be a ratio. Not as common.

Interval EX. Temperature - Fahrenheit and Celsius both measure temperature. The measurements can differ based on the scale, so a day one might be 5C or 41F and day two might be 10C or 50F. Day two is twice as warm in C but not in F. 0C and 0F are not true 0s. Kelvin does have a true 0

Ratio: have a true 0. Most things in the real world are ratios.

Ratio Ex. I have 2 apples and you have 6. you have 3 times as many apples because the ratio of 6/2 is 3. Other examples would be number of objects in general, distance and time.

76
Q

Visualization - Categorical Variables

A

Typical visualizations = frequency distribution tables, bar chars, pie charts and Pareto diagrams

Frequency distribution tables - useful but not very visual. A good starting point for other visualizations and work

Bar chars - built from frequency tables and much more visually intuitive

Pie charts - built using the relative frequency (how much of the total each category represents). Very visual and intuitive. They are especially useful in showing the relationship between the variables and the share each variable has of the total. Market share is almost always represented by pie charts.

Pareto diagram- Just a special bar chart with categories shown in descending order of frequency. There is also a curve on the same graph showing the cumulative frequency above each category as you move from high to low. This combines the strong sides of the pie chart and the bar chart. It shows how subtotals change with each category.

77
Q

Visualization - Numerical Data

A

Typical visualizations = Frequency distribution tables,
Histograms

Frequency distribution tables - useful but not very visual. A good starting point for other visualizations and work. When making a frequency distribution table for numerical data, it makes sense to group the data into intervals and find the frequency of the interval rather than the each individual number. This makes a summary of the data that allows for a meaningful visualization.

For many analyses it’s useful to calculate the relative frequency of each interval. Relative frequency = frequency/ total frequency.

Once numerical data has been entered into a frequency distribution table and divided into intervals with relative frequency it can be plotted.

The most common plotting method/ visualization for numerical data is a histogram.

78
Q

Frequency Tables and choosing intervals

A

Generally statisticians prefer working with between 5 and 20 intervals but this depends on the amount of data you’re working with.

Intervals are generally of equal width.

The formula for finding the size/ width of intervals for a data set is ((largest number - smallest number) / number of desired intervals).

You will generally round up or down to gets a clean interval break. It’s the width of the interval that’s most important, so an interval from 1-21 seems odd but it has a width of 20.

A number is included in a particular interval if that number is:
1. Greater than the lower bound
2. Less than or equal to the upper bound

79
Q

Visualization - Categorical Data - 2 variables

A

Cross tables aka contingency tables- similar to a frequency distribution table but there is a second variable along the top row. This forms a basic row and column structure. The data is placed at the intersection of the two variables. Best practice is to calculate the subtotals of each row and column as these can be useful for further analysis later on.

The data in a cross table can also be visualized in a side by side bar chart.

80
Q

Visualization - Numercial Data - 2 variables

A

Scatterplots

Used to represent two numerical variables. Ex. reading and writing scores on an SAT. The x axis shows one variable and the 2nd shows another. Each data point is plotted as a dot at the appropriate intersection on the graph.

Scatter plots are useful for working with lots of observations. The point is not the individual data points but the general idea, the pattern of distribution. Look or the general direction, outliers and clusters of data in a particular area.

81
Q

Side by Side Bar Chart

A

This is a variation on the regular bar chart.

It breaks up the variables so that you can see the relationship between one value for one variable and all the values of the other variable. There will be multiple bars with different colors for each.

82
Q

Variability in population and sample data

A

Typically, different formulas are used for population data vs. sample data.

This is because each point is know in population data
A statistic is an approximation.

The sample formulas have been changed to include a slightly higher level of uncertainty.

83
Q

Measures of variability - Multivariate

A

Used when there are 2 or more variables.

Covariance
Correlation Coefficient

84
Q

Covariance

A

Used when there are 2 or more variables

85
Q

Inferential Statistics

A

Inference = An interpretation that goes beyond the literal information given.

Inferential stats rely on distributions and probability theory to predict population values based on sample data.

We’ve taken one sample of size n and made some claims about the general population…
What if we were to take another sample of size n?
Would we get the same result?
Would we make the same claims about the general population?

The fundamental question:
When we make some inference about a population based on our sample, how confident can we be that we’ve got it ‘right’ ?

86
Q

Correlation and Causation

A

When calculating the Correlation Coefficient

The correlation between x and y = the correlation between y and x. The formula is symmetrical.

Causality: It’s important to understand the direction of causal relationships because the correlation formula is symmetrical. The formula itself only establishes the strength of the relationship, not the causality. In housing, the size causes the price. The price doesn’t cause the size.
Relationship direction: Some relationships only move in one direction. Ie. House size is correlated with price but a house can later increase in price without increasing in size.

Correlation does not imply causation.

87
Q

Probability Distribution

A

A distribution is a function that shows the possible values for a variable and how often they occur.

It consists not only of the input values but of all possible values.

The sum of all probabilities must equal 1 or 100%.
The probability of an impossible event is 0.

Distributions can be shown as either graphs or tables. The graph is just a more visual representation of the underlying probabilities.

88
Q

Common Distribution Types

A

Normal Distribution
Binomial Distribution
Uniform Distribution

Poisson Distribution
Bernoulli distribution

89
Q

Normal Distribution

A

Most common. AKA the bell curve, AKA gaussian distribution

It is symmetrical and all central tendency measures are equal. It has no skew.
Notation for a normal distribution is as follows: N~(µ, σ2)

N = normal
~ = distribution
µ = mean
σ2 = sigma squared aka the variance

Add the 0 point to any graph to give perspective to the distribution

90
Q

Binomial Distribution

A

Distribution created by n independent possibilities, each with the same probability. IE. A coin flip

binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success (with probability p) or failure (with probability q=1-p).

A single success/failure experiment is also called a Bernoulli trial or Bernoulli experiment, and a sequence of outcomes is called a Bernoulli process;

for a single trial, i.e., n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.[1]

91
Q

Uniform Distribution

A

A distribution where all outcomes have an equal probability is a discrete uniform distribution. Ex. Rolling one die.

A discrete variable will yield a graph of bars, a continuous distribution will be a line.

92
Q

Poisson Distribution

A

Used to show how many times an event is likely to occur over a specified period

a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.

The Poisson distribution is defined by the rate parameter, symbolized by the Greek letter lambda, λ.

lambda (λ) is the mean number of events within a given interval of time or space.
The variance will also equal λ

let’s say we are salespeople, and after many weeks of work, we calculate our average to be 10 sales per week. If we take this value to be our λ, the distribution will be a bell curve centered on 10, with tails representing possible outliers

93
Q

Bernoulli distribution

A

Bernoulli distribution is a discrete probability distribution. It describes the probability of achieving a “success” or “failure” from a Bernoulli trial. A Bernoulli trial is an event that has only two possible outcomes

n = 1 is success with a p of 0.6

94
Q

Standardization - Calculation

A

Standardization is the process of transforming the distribution to one with a mean of 0 and a standard deviation of 1.

You’re turning all the datapoints into z-scores and then comparing those. You’re not comparing the actual variable values.

The standardized variable is called a z-score and is equal to the original variable minus its mean, divided by its standard deviation.

The key here is that the formula must be applied to each variable in the set. This will create a new data set that will have a mean of 0 and a standard deviation of 1

Remember that adding or subtracting values from all data points does not change the standard deviation

95
Q

Standardization - Purpose/ interpretation

A

Every distribution can be standardized.

Normal distributions result in a “normal standard distribution when standardized

It makes inference and working with the data easier.

useful in situations where variables are measured on different scales or have different units of measurement.

Allows for meaningful comparisons between variables by putting them on a common scale.

Helps in data analysis and modeling techniques that assume the variables to be normally distributed or have equal variances.

Can be beneficial when dealing with outlier detection or in algorithms that utilize distance-based calculations, such as clustering or certain machine learning algorithms. By standardizing the variables, outliers are less likely to heavily influence the analysis or the results of these algorithms.

96
Q

Standarization Scenarios

A

Standardization itself is not a requirement for any specific operation or analysis. It is primarily used as a data preprocessing step to facilitate certain statistical analyses, modeling techniques, or algorithms. While it may not be necessary for all situations, standardization can offer several advantages in various contexts.

Here are a few scenarios where standardization can be particularly beneficial:

Comparing Variables: When you want to compare variables that have different scales or units of measurement, standardization allows for meaningful comparisons by putting them on a common scale.

Multivariate Analysis: In multivariate analysis techniques like principal component analysis (PCA) or factor analysis, standardization is often employed to ensure that variables contribute proportionally and are not dominated by differences in scale.

Distance-based Algorithms: In clustering algorithms, such as k-means clustering, or distance-based algorithms like nearest neighbors, standardization can prevent features with larger scales from dominating the calculation of distances or similarities.

Interpretability and Coefficient Comparisons: In linear regression or logistic regression models, standardizing variables can help in the interpretation of coefficients, making them comparable and allowing for a meaningful comparison of the magnitude and direction of the effects.

Outlier Detection: Standardization can assist in identifying outliers, as extreme values can be detected more easily when the data is standardized.

97
Q

Sampling distribution of the mean

A

Means of samples can vary between different samples. Taking a single value is suboptimal because you don’t know which sample mean is the most correct.

To get around this issue, we draw many samples and create a new data set comprised of sample means. This is a sampling distribution of the mean.

If the sample means are what you’re looking at, it would be a “sampling distribution of the mean”.

The sampling means will be different but should gather around a certain value.

Each sample is an approximation of the population mean, so the value they all revolve around will be the population mean itself.

Taking the mean of a sampling distribution should give a very accurate idea of the population mean.

The standard deviation of the sampling distribution of the mean is given by

98
Q

Standard Error

A

The standard deviation of the distribution formed by the sample means. Standard error is computed as sigma / the square root of n.
Standard error shows the variability of the means of the samples in the sampling distribution.
Standard error decreases as sample size increases.

Standard error shows how well you approximated the true mean and is used for almost all statistical tests because of this.

99
Q

Estimators

A

An approximation based solely on sample information.

A specific value is called an estimate.

There are two types of estimates:

Point estimates
Confidence interval estimates

100
Q

Point estimates

A

A single number located exactly in the middle of the confidence interval. Ex. The sample mean X-bar is a point estimate of the population mean µ.

The estimator is x-bar, the parameter is µ and the estimate is the specific value of x-bar.

Sample variance is an estimate of sigma squared in the same way.

Point estimators are not very reliable because they are approximations based on samples that yield values much different from the actual value of the population parameter.

There can be many estimators for the same variable and they all have two properties, bias and efficiency.

More efficient and less biased estimators are preferred.

101
Q

Point estimate - Bias

A

An unbiased estimator has an expected value equal to the population parameter. X-bar + some other value x would have a bias of x when it comes to being an estimator of µ.

102
Q

Point Estimate - Efficiency

A

The most efficient estimators are the ones with the least variability of outcome. I.e. the unbiased estimator with the smallest variance.

103
Q

Estimators vs. Statistics

A

Statistic is the broader term.

A point estimate is a statistic.

104
Q

Confidence interval estimates

A

The range within which you expect the population parameter to be. Estimated based on the data in your sample.

An interval that contains the point estimate. Confidence intervals are preferred as they provide more info than point estimates.

Ex. You visit 5% of the restaurants in London and calculate the mean price of a mean for all restaurants as 22.50. This point estimate might be close to the population parameter but chances are that the true value isn’t 22.50. Rather, the true value is likely something close to 22.50. So 22.50 plus or minus x. It’s safer to say that the average meal in London is somewhere between 20 and 25£. This is a confidence interval around your point estimate. This is more accurate but there is still uncertainty.

This uncertainty is measured using levels of confidence.

105
Q

Levels of Confidence

A

Used to determine the confidence interval

You can never be 100% confident unless you know the real population parameter.
You might say that you’re 95% confident that the parameter is inside your confidence interval.

A 95% confidence interval means you are sure that in 95% of the cases, the population parameter would fall within the specified interval.

This leaves a 5% chance that you’re wrong and the µ is outside the interval. This can happen if your sample is not representative of the population.

Level of confidence for an interval is denoted by 1-α (alpha). Alpha is a value between 0 and 1. If the confidence level is 95% then alpha is 5%. If 99%, alpha is 1% etc.

The formula for level of confidence is:

[point estimate - reliability factor x standard error, point estimate + reliability factor x standard error].

Point estimate is the value you’re working with, for example x-bar
Standard error is sigma÷√n
Reliability factor is the Z or T Stat of alpha/2.

Alpha is divided by two because it accounts for the two tails of a normal distribution.

106
Q

Calculating Confidence intervals for a population

A

Population variance can be known and unknown.

Different calculation methods are used for each situation

A 100% confidence interval is generally completely useless, it tells us nothing specific.

95% is much more useful because it strikes a balance between the level of certainty and the size of the confidence interval.

107
Q

Confidence interval for population with known variance

A

An important assumption in this calculation is that the population is normally distributed.

Even if it isn’t, you should use a large sample and let the Central Limit Theorem do the normalization.

There will be a trade off between estimation precision and the level of confidence.

As level of confidence goes up, the confidence interval will get bigger and vice versa.

A narrow confidence interval means more uncertainty.

108
Q

Confidence interval for population with unknown variance

A

In practice, the population variance is rarely known. The students T is the more common method.

If population variance is unknown and the sample size is small. Use the Students-T distribution.

Whether you’re using the Z-table or the T-table, the logic of calculating the confidence interval is the same. The only changes at to two inputs:

Instead of using the Z-stat you use degrees of freedom.
Instead of using population standard deviation, you use sample standard deviation.

Knowing the population variance provides a narrower confidence interval at the same level of confidence. It’s a more accurate way to go. Having a smaller sample size makes the confidence interval wider. So prediction is still possible based on a small sample size and unknown population variance but it’s less accurate.

109
Q

Z-stat vs. T-stat

A

The z-statistic and t-statistic are both statistical measures used in hypothesis testing and estimating population parameters.

They differ in their underlying assumptions and applications.

z-statistic is suitable for large sample sizes or known population standard deviation,

t-statistic is appropriate for small sample sizes or unknown population standard deviation.

The choice between the two depends on the specific characteristics of the data and the objectives of the statistical analysis.

110
Q

Z statistic

A

Used when the sample size is large (typically greater than 30) or when the population standard deviation is known.

Calculated by subtracting the population mean from the sample mean and dividing it by the population standard deviation.

The z-statistic follows a standard normal distribution (mean of 0 and standard deviation of 1) under the null hypothesis.

It is commonly used in situations where the population parameters are known, or the sample size is sufficiently large to approximate the population parameters.

111
Q

T statistic

A

The t-statistic is used when the sample size is small (typically less than 30) or when the population standard deviation is unknown.

It is calculated by subtracting the population mean from the sample mean and dividing it by the sample standard deviation.

The t-statistic follows a t-distribution, which is similar to the normal distribution but has thicker tails to account for the additional uncertainty associated with estimating the population standard deviation from a small sample.

The shape of the t-distribution depends on the degrees of freedom (sample size minus 1), and as the sample size increases, the t-distribution approaches the shape of the standard normal distribution.

The t-statistic is commonly used when the population parameters are unknown or when working with small sample sizes.

After 30 degrees of freedom, the T and Z statistic tables become almost identical. Rule of thumb is to use the Z-table for samples with over 50 observations

112
Q

Margin of Error Vs. Confidence Interval

A

Margin of error is a measure of the potential error associated with a sample estimate.

The margin of error represents the precision of the estimate.

Confidence interval is a range of values that provides an estimate of the likely range for the population parameter.

Confidence interval indicates the range within which the true population value is expected to fall with a given level of confidence.

113
Q

Margin of Error

A

The margin of error quantifies the uncertainty or potential error associated with estimating a population parameter based on a sample. It represents the maximum expected difference between the sample estimate and the true population value.

The margin of error is typically expressed as a single value or a range around the sample estimate, denoting the precision of the estimate.

It is calculated using formulas that take into account the sample size, the level of confidence desired, and the variability of the data.

114
Q

Confidence Interval

A

The confidence interval is a range of values within which the true population parameter is estimated to fall with a certain level of confidence.

It provides a measure of the precision of the estimate by specifying a range of plausible values for the population parameter.

The confidence interval is calculated using the sample data and takes into account the sample size, the variability of the data, and the desired level of confidence.

It is typically reported as a range around the sample estimate, representing the lower and upper bounds of the interval.

115
Q

Calculating Margin of error

A

Margin of Error for a Population Mean (σ Known):
Margin of Error = Z * (σ / √n)

Where:
Z is the critical value associated with the desired level of confidence.
σ is the known standard deviation of the population.
n is the sample size.

Margin of Error for a Population Mean (σ Unknown)
Margin of Error = t * (s / √n)

Where:
t is the critical value associated with the desired level of confidence, determined based on the degrees of freedom (n-1).
s is the sample standard deviation.
n is the sample size.

Smaller reliability factors (z or t) and smaller standard deviations will reduce the margin of error

116
Q

PPDAC cycle

A

Problem - Plan - Data - Analysis - Conclusion

117
Q

Negative and Positive Framing

A

Given data reporting deaths from a surgical procedure

Negative framing would be reporting mortality
Positive framing would be reporting survival

Reporting actual numbers and percentages can increase the impression on the audience because it helps them picture a crowd of people.

99% of youth are not violent in London - this sounds good
1% of youth are violent in London and that 1% is approximately 10,000 people based on the population - this is the same info - seems worse.

Ideally both positive and negative frames should be used to seem impartial

The order in which info is displayed has a large impact as well.

Where you start the x-axis is important - start at 0 vs start at 95 - same info will look very different in terms of readability etc.

118
Q

Pie charts and comparisons

A

Pie charts are useful for showing how much of the whole each category makes up - but they are confusing to look at if there are many categories

Multiple Pie charts are generally a bad idea - it’s hard to judge the relative sizes of the pie slices on the multiple charts.

Comparisons are better based on height or length alone.

Use a bar chart for comparisons instead of a pie chart.

119
Q

Relative Risk

A

Relative risk compares the risk of an outcome between exposed and unexposed groups.

18% increase in cancer for people eating processed meat daily compared to the baseline population, which already has some risk of getting cancer.

Baseline risk is 6 out of 100 people

6/100 + 18% increase = 7/100 for 100 people eating processed meat everyday

120
Q

Absolute Risk

A

Absolute risk refers to the actual probability of an outcome occurring in a specific group regardless of any other factors.

In this context - 18% would be the change in the actual proportion of people getting cancer.

To avoid confusion - use expected frequency rather than percentages or probabilities.
Ask - what does this mean for 100 people.
Just stating an 18% increase as above is manipulative.

1 in x is a common way to show risk - but it’s hard to grasp, because a larger number (1 in 1000) means a lower risk.

121
Q

Odds ratios

A

The odds of an event happening are the ratio of the chance of the event occurring to the chance of it not occurring

If six people out of 100 get cancer the odds are 6/94.

If after a treatment, four people out of 100 get cancer the odds of getting cancer are 4/96

The change in the risk of getting cancer due to the treatment is also an odds ratio

(6/94) / (4/96) = 1.53 or a 53% decrease in relative risk.
#Odds for cancer without treatment/odds for cancer with treatment
It’s a 53% decrease in relative risk but only a 2% decrease in absolute risk.

Odds ratios are confusing
If events are rare, the odds ratio will appear close to the relative risk
If the events are common, the odds ratio can be very different from the relative risk

122
Q

Logarithmic Scale

A

A scale is which the space between 100 and 1000 is the same as the space between 1000 and 10,000.

On a linear scale every unit of distance corresponds to adding the same amount,

On a logarithmic scale, every unit of length corresponds to multiplying the previous value by the same amount.

Hence, such a scale is nonlinear: the numbers 1, 2, 3, 4, 5, and so on, are not equally spaced. Rather, the numbers 10, 100, 1000, 10000, and 100000 would be equally spaced.

123
Q

Conditional Probability Tree

A

Each branch represents a specific set of events.

The probabilities the terminal branches (all possible sets of outcomes) sum to one.

We multiply across branches (using the multiplication rule!) to calculate the probability that each branch (set of outcomes) will occur.

124
Q

Computing Variance

A

Variance is represented by sigma^2

It tells us how spread out a data set is.
A larger number = more spread
The more spread the data, the larger the variance is in relation to the mean.

Find the difference between every datapoint and the mean of the data set. Square that difference to ensure that it’s positive
Find the mean of those differences - this is the variance, the average distance (squared) of the the data points from the mean.

125
Q

Metaphorical Population

A

You have all the available data (for example murder rates). You have the population.

You act as if the population data you have is actually a sample drawn from some larger imaginary space of probabilities.

The population you have is just one sample of all the possible populations.

This allows you to use the data you have to learn about the other possible scenarios that might have occurred.

This mind set allows you to look at the real world and ask how likely it is that you’ll see something similar happen in the future. With the future being another sample drawn from the imaginary population of possibilities.

126
Q

CDF

A

The cumulative distribution of a value is the fraction of data points that fall at or below that value

pop_heights_sorted = np.sort(pop_heights)
total_cdf_test = 0

for i in range(len(pop_heights_sorted)):
if pop_heights_sorted[i] <= 177:
pop_heights_cdf[i] = i/len(pop_heights_sorted)
total_cdf_test += 1

print(total_cdf_test/len(pop_heights_sorted))

127
Q

Parametric vs. NonParametric

A

Two types of statistical tests that are used to compare two groups of data

Parametric tests are based on the assumption that the data is normally distributed. This means that the data is bell-shaped and symmetrical. Parametric tests are more powerful than nonparametric tests, but they are also more sensitive to violations of the assumptions.

Nonparametric tests do not make any assumptions about the data. This means that they can be used with data that is not normally distributed. Nonparametric tests are less powerful than parametric tests, but they are also less sensitive to violations of the assumptions.