(1) Basic Statistical Concepts Flashcards
Normalization
forces something into a normal distribution
Standardization
dividing it by something to remove its effect
Ex: dividing something by area of pop size
QQ/quantile plot
Visualization to see if data is normally distributed
negative = points curve beneath line
positive skew = points curve above
normal = points are on line
r or coefficient of correlation
looks at whether 2 variables vary together
Range for correlation coefficient and what is positive/negative/0?
-1 to 1
positive = both variables go up
negative = one goes up, one goes down
0 = no association
Standard deviation
measures how far data values are from the mean
little variation in values means small standard deviation
Analysis of variance (ANOVA)
Parametric test to see if there are significant differences in 3+ categorical groups
Covariance
Testing 2 variables to see if they vary together or not using a correlation coefficient (r)
Kernel density (3 facts about it)
- removes statistical noise from data by smoothing it
- Uses Gaussian weighting (closer points = more weight)
- good for showing generalized densities of points
p value (3 facts)
- doesn’t tell you size of difference, just that there is one
- says if result is significant
- whether or not to reject null hypothesis
How to use a p value in a sentence to explain random chance and null hypothesis (hint: %)
- ___% chance you saw these results by random chance
- ___% chance you are falsely rejecting the null hypothesis
histogram
x-axis = category
y-axis = frequency in that category
way to visualize frequency/distribution of data
Z score meaning
Number of standard deviations away from the mean
Z score formula
(score - mean) / standard deviation
Coefficient of determination (r-squared)
High = good fit
Low = poor fit
How much of the variance in y is described by variance in x
Sentence using coefficient of determination
Variable x explains 80% of the variation in variable y
Kruskall-Wallis test
look at more than 2 populations for similarity
non-parametric version of ANOVA
Central limit theorem
Distribution approaches normal as sample size increases
Mann-Whitney U
compares 2 sample populations
non-parametric
scores are ranked from small to large and then ranks of scores are compared
Sample mean
mean of a sample of the data
non-parametric statistics (list tests)
does not follow a Gaussian distribution
Mann Whitney-U, Kruskall Wallis, Spearman’s Rho
Normal (Gaussian) distribution (3 facts)
- follows a bell curve
- uses parametric stats
- defined using the mean/standard deviation
Normal QQ plot
is like a qq/quantile plot but compares the data quantiles against the quantiles of a normal distribution
Null hypothesis
no significant difference, effect, or relationship in the population
Parametric statistics (also list tests)
follows a Gaussian distribution
2 sample t test, ANOVA, Pearson’s R/correlation
Parsimony
Keep it simple and make it clear
Pearson’s R
measure the strength/direction between 2 variables
Parametric
Residual plot
plots the residuals from a regression model
If there is an obvious pattern to the residuals than the model might not work
Residuals
distance between point and the best fit line
kind of like error
Interpreting residuals (+ and -)
+ = overestimating rates of something
- = underestimating rates of something
Shapiro test
null hypothesis = samples come from a normal distribution
Spearman’s Rho
compares differences between the ranks in 2 data sets
values range from -1 - +1 (same as r-squared value)
Square of the error
quantifies difference between observed and expected values