Basics Flashcards
Summation / Sigma Notation
This is the sigma symbol: ∑
It tells us that we are summing something.
n is the summation index - When evaluating the expression we substitute different values for the index.
Frequency tables - lists - dot plots
Used to represent a single variable.
A list is just a list of variable values
A Frequency Table is a table showing each value and how often it occurs
A dot plot is a visual frequency table, with the variable value on the x and the frequency on the y.
These are all ways of representing the same info.
Once the data is organized we can start to analyze it with summary stats etc.
Histogram
Used to represent a single variable
Like a bar chart but both the x an y axis are numerical.
x-axis = intervals
y-axis = absolute frequency of each interval.
The bars will be touching to show that one interval beings where the other ends.
Instead of just plotting out the frequency of each discrete value, like a frequency table or dot plot, a histogram arranged the data in categories and then shows how many values fall within the category. The categories are often called buckets or bins.
Bins should not overlap
Descriptive statistics
Ways of describing data without just providing the raw data. It’s about describing the data with a smaller set of numbers.
This would include thinks like summary statistics.
Inferential Statistics
Ways of gaining insight from the data set and figuring out what the data means. How can we use the data to understand what the population value might be.
The key to inferential statistics is understanding that samples do not always accurately reflect the population they came from.
A large part of inferential statistics is quantifying our uncertainty about a population by looking at a smaller sample.
Average/ Central Tendency
Average = Typical or middle value of a data set. The “central tendency” of the data
Common types:
Mean
Median
Mode
The ‘best’ measure of central tendency will depend on which measure best represents the actual data and how it is skewed (or not).
All measures should be used in combination to understand the data set.
Median
The middle number of the data set when the set is placed in numerical order. If there is an even number of values, you take the mean of the two middle numbers.
The median is useful if there are outliers that will skew the mean and make it misleading.
Mode
The most common number in the data set. If there is no most common number then there is no mode.
Typically the least used measure of central tendency
The mode is useful if there are outliers that skew the mean and if there is a single number that shows up a lot.
Location of central tendency and skewness
In symmetrical distributions the mean, median and mode are identical or very close.
In left skewed distributions the mean is typically to the left of the median, which is to the left of the mode.
In right skewed distributions the mean is typically to the right of the median, which is to the left of the mode.
Left and right skew
A left skew means the tail/ outliers are to the left
A right skew means the tail/ outliers are on the right.
Interquartile Range (IQR)
The IQR is a measure of how spread out the data is.
(IQR) is the distance between the first and third quartile marks (25th to 75th percentile).
The IQR is a measurement of the variability about the median.
IQR tells us the range of the middle half of the data.
To find the IQR:
1. Find the median of the data set
2. Find the median of each set of numbers on either side of the median number. The IQR is the difference between these two numbers.
Outliers
The definition of what is reasonably an outlier is subject to some interpretation based on the specific qualities of the data set.
Common definition:
An outlier is any number that is more than 1.5x the interquartile range below Q1 or above Q3
Sample Mean
Calculated the same way as the population mean
Measures of Variability - Univariate
Variance
Standard Deviation
Coefficient of Variation
Sample Variance
Variance is a measure of the spread or dispersion of a set of data points around their mean. It quantifies how much the individual data points deviate from the average.
Sample variance is generally a pretty good statistic in terms of approximating the true variance of the population.
A better approximation of the population parameter can usually be gained by dividing by n-1.
This approximation is AKA ‘The unbiased sample variance’.
Dividing by just n will tend to underestimate the population variance.
Dividing by n is fine if you just want the varianve/SD of the sample itself.
S2 or s squared symbol
s² = ∑(x - x̄)² / (n - 1)
Using n-1 instead of n is AKA Bessels Correction
Standard Deviation
Measure of the dispersion or spread of a set of data points around their mean. It is closely related to variance but is expressed in the same units as the original data, making it easier to interpret and compare.
The standard deviation is the square root of:
The population variance
OR
The unbiased sample variance (S^2/ n-1)
The square root of the sample variance (AKA the sample standard deviation) will not be an unbiased approximation of the population standard deviation.
This is because the square root function is non-linear.
SD is Written as:
s
std(x) - SD of random variable x
lowercase sigma
Variance - Interpretation
Variance is always non-negative
Interpreting variance involves considering the magnitude of the variance value and its relationship to the data set.
Consider:
Magnitude - Higher variance means more dispersion from the mean
Units - Variance is expressed in squared units of the original data. Restore the original units by converting to standard deviation
Comparison - If one data set has a significantly higher variance than another, it implies that the observations in the first data set are more widely scattered.
Outliers - Variance is sensitive to outliers, they can inflate the variance, making it a less reliable measure of dispersion
Limitations:
Variance does not tell us the direction of variations from the mean.
It treats positive and negative differences equally.
Not robust in non-normal - heavily skewed data sets.
Coefficient of Variation(CV)
AKA relative standard deviation.
Calculated by standard deviation divided by the mean value. It’s just the standard deviation relative to the mean.
There are separate formulas for population and sample data for this measurement as well.
CV is used to compare the variation of two different data sets.
It will return a number that is not in units and is directly comparable across data sets.
Standard Deviation - Interpretation
Better for comparisons between datasets because the units are normalized.
Quantifies the typical amount of variation or “typical distance” of data points from the average.
Consider:
Magnitude - Higher SD means data points are spread farther from the mean.
Units - SD is expressed in the same units as the original data. This makes interpretation and comparison easier.
Range - SD provides a useful range around the mean (68-95-99.7). It helps us visualize where the data is falling using a single number.
Comparison - Comparing the standard deviations of different data sets allows you to assess their relative spread.
Outliers - SD is sensitive to outliers. Outliers, which are extreme values, can have a significant impact on the standard deviation.
Limitations - SD assumes a normal or symmetrical distribution. Is the data has a heavy skew - other measures might be more appropriate
Mean vs. Median as Central Tendency
The measures work in pairs:
More symmetrical Data:
Mean = central tendency
Standard Deviation = Spread
More Skewed Data:
Median = central tendency
IQR = Spread
Outlier values will move the mean quite a lot but they don’t effect the median, which depends on the sample size, not the actual data point values.
Z-scores
One of the most common measures in statistics.
A Z-score tells you how many standard deviations away from the population mean a given data point is.
This helps you tell how usual or unusual a data point is.
This can be useful for comparing data points from different distributions. The scales are different but the relative position to the mean can still be compared.
To calculate the Z-score for a data point x:
(x-µ) / standard deviation
Z-scores - Interpretation
The Z-score of a data point tells you how many standard deviations from the mean the point is.
A negative Z-score indicates that the data point is below the mean, a positive Z-score indicates that the data point is above the mean
A z-score of 0 means the data point is equal to the mean.
A z-score of 1 means the data point is one standard deviation above or below the mean.
A z-score of 2 indicates it is two standard deviations away, and so on.
Typically, data points with z-scores greater than 3 or less than -3 are considered extreme outliers
You can use a table to find the percentile of a data point given it’s z-score
Empirical Rule and Normal Distributions
The Empirical Rule is AKA the 68-95-99.7 Rule
Marginal Distribution
The distribution formed by the totals of a single variable in a two way table. This data can be represented as numbers or as percentages.
Look at the margins of the table.
Conditional Distribution
In a 2 way table
The distribution of one variable given that some condition with the other variable is met.
Conditional distributions are generally represented as percentages.
As in “What percent of men prefer basketball”
Scatter Plots
Scatter plots are used to plot bivariate relationships.
Being able to fit a line to the data is a good way to determine the strength of the relationship between the two variables.
The closer the line matches the data, the stronger the relationship.
The realtionship can be linear or non-linear.
In a linear relationship the variables are changing at roughly the same constant rate.
If it’s non-linear, the rate of change varies in different parts of the distribution.
A line with a negative slope indicates a negative relationship between the two variables.
A positive slope indicates a positive relationship.
Correlation Coefficient
Used when there are 2 or more variables
Denoted by the variable r
AKA the Pearson correlation coefficient
Range from -1 to 1
r = 1 is a perfect positive correlation
r = 0 indicates no correlation
r = -1 is a perfect negative correlation
The value that is considered significant varies based on the field.
Social sciences = |0.3|+
Hard Sciences = |0.7|+
r values don’t quantify statistical significance - only correlation
Calculated as the covariance divided by the product of the standard deviations of the two variables
Linear Regression & Least Squares
Linear regression - The process of finding the line that best fits a set of data
The most common method is to try to fit a line that minimizes the square distance to each point in the data set. This is a “least-squares” regression.
The equation for a linear regression is written as y=mx+b but the y will have a ˆ over it. This indicates that the y value is an estimated value. It can’t be an exact value because all the stat points will not sit directly on the line.
Residuals
A residual is the difference between the actual value of a data point and the estimated value provided by the linear regression.
For a given x value, the residual is the actual value (y) minus the estimated or predicted value (yˆ)
A negative residual means the actual value is below the estimated value.
A positive residual means the actual value is above the estimated value.
The process of finding a line of best fit is about minimizing the square of the residuals
Residual Plots
A plot of the residuals in a data set. The x values stay the same as the data set but the y values become the residuals of the data set values.
Residual plots are used to gauge whether a line is a good fit for a data set or not.
A good fit will be indicated by the residual points being clustered above and below a y value of 0.
You don’t want to see trends in the residual data. If there is a trend, you might need a better linear regression line or you might need a non-linear regression.
Experiment
Involves a dependent and independent variables with control and experimental/ treatment groups.
You look for statistically significant differences between the treatment and control groups.
The independent variable (x) is AKA the explanatory variable.
The dependent variable (y) is AKA the response variable.
Observational Study
Involves collecting data and looking for existing patterns and correlations.
Observational studies can identify correlation but not causation between variables.
There are different types of observational studies
Data can be backward looking, forward looking, or based on information gathered right now.
Retrospective study
Samples past data to gain insights
Prospective study
Pick a sample and track the data from that sample over time. You can analyze the data at the end of some time period or as it is collected.
Sample Survey
Involves taking a sample of data from a given population and gathering information on the state of things right now.
Voter preferences is a good example of this.
Longitudinal study
Can be prospective or retrospective
Involves collecting data from the same group of individuals or subjects over an extended period.
The primary goal of a longitudinal study is to observe and analyze changes and trends that occur over time within the same individuals or groups.
Researchers typically make repeated measurements or observations at multiple time points. This allows them to examine the long-term effects, developmental patterns, or causal relationships between variables
Cross-sectional study
data is collected from different individuals or groups at a specific point in time.
Unlike longitudinal studies, cross-sectional studies focus on a single time point and aim to gather information about the characteristics, behaviors, or opinions of different individuals or groups at that particular moment.
Survey Bias Types
Response Bias
Under coverage
Voluntary response sampling
Convenience Sampling
Non-response
Response Bias
Question phrasing or the question itself makes it unlikely that people will answer truthfully.
Ex. Have you lied to your parents in the past week? Have you ever cheated on your spouse?
Under coverage
Responses don’t take into account a key constituency. Calling 100 random people in the phone book when cell phones are not included in the phone book. There might be something different about people who only own cell phones or who have chosen to be unlisted in the phone book.
Under coverage will typically underestimate the % of the population with a given response.
Voluntary response sampling
Non-random sampling caused by respondents self selecting to complete a survey.
Voluntary response bias will typically overestimate the % of the population with a given response.
Convenience Sampling
Using a non-random sample because it’s available to you. Typically will overestimate the percent of the population with a given response.
Non-response
Lack of data can cause a source of bias if it’s big enough
Simple random sampling
Throw the population in a bowl and have a blindfolded person pick a sample of the total population
Use a random number generator to pick members of the population. Put the population data set in alphabetical order and assign each entry a number. Then randomly generate numbers for your sample size and match them with the population data points.
Use a random digit table to pick out random numbers.
You can’t just think up numbers, you are not capable of being truly random.
Simple random samples can inadvertently introduce bias by randomly selecting a non-representative sample.
You can avoid this with Stratified Sampling and Clustered Sampling
Stratified sample
Type of random sampling
Take the entire population and break it into strata, or different groups and randomly sampling each strata.
In a high school this might mean breaking the student population into freshman, sophomores, juniors and seniors and then taking a random sample of 25% of your total desired sample size from of each group.
Clustered Sample
Type of random sampling
Divide the population into groups that are broadly representative and then randomly sample the groups.
An example of this would be randomly sampling classrooms that have a generally representative mix of men and women.
You randomly pick the classroom and survey everyone in it.
Voluntary sampling
Type of non-random sample
Bias is introduced because people are self selecting to fill out the survey
Convenience Sampling
Type of non-random sample
Bias is introduced when the most convenient sample does not happen to be representative. The first 100 people in the door are convenient, but may not represent the population.
Systematic Random Sampling
Can be used when simple random sampling isn’t logistically feasible.
Consists of randomly sampling a sub-set of the population.
Given a desired sample size of 100
You pick the first subject at random and then sample every 100th person after that initial , randomly picked, person.
Systematic random sampling is not fool proof. There can still be bias if you’re not careful. You need to be sure that the sample isn’t being distorted in some way so that the person chosen by the interval is truly random.
Experiment Design
An experiment has an explanatory variable and a response variable.
The explanatory variable (x) causes the change in the response variable (y).
Experiments use randomly selected samples to infer the characteristics of the population as a whole
The random sample will then be split into the control/ treatment group in some way.