Stats Flashcards
Continuous variable
Can take on any value within a given range.
An infinited number of poosible values, limited only by our ability to measure them.
Discrete variable
Can only take on certain distinct values within a certain range.
The scale is still meaninful.
Ranked variable
A categorical variable in which the categories imply some order or relative posistion.
Numerical values are usually assigned.
Categorical variable
One in which the “value” taken by the variable is a non-numerical category or class.
Dot plot
Like a bar graph but with dots.
One dot per data point
Frequency table
Divide the number line into intervals.
Count the number of data points within each interval - frequency.
Relative frequency is the proportion of weights in each interval.
Guidelines for forming class intervals
(3)
- Use intervals if equal lengths with middpoints at convenient round numbers.
- For a small data set, use a small number if intervals.
- For a large data set, use more intervals.
Stem and leaf
ie:
2 1234557
3 033456
4 1234555667
5 1233
stem = tens digit
leave = list of units that take than tens digit - should be in order
Summary statistics
Any set of measurements has two properties: the central or typical value and the spread about that value.
Mean
Average
Sum of data / number of data
Median
The value in the middle of all the data if it is ordered from smallest to largest.
Mode
Most common value in the data set
Interquartile range
Data are split into 4 groups.
How far apart groups 1 and 4 are.
Sort of medians but for quarters.
Box and Whisker plot
Median and interquartile range shown as the box.
Whiskers are extended to the furthest point that isnt an outlier.
Outliers are points further than 1.5x the IQR and are shown as dots.
Standard deviation
Measure of spread around the mean.
1. calculate mean
2. Calculate difference between mean and each value
3. square differences
4. Sum the squares
5. Divide by n-1
6. Square root
Sample variance
Better measure of spread around the mean than standard deviation.
1. calculate mean
2. Calculate difference between mean and each value
3. square differences
4. Sum the squares
5. Divide by n-1
Z scores
Shows how many standard deviations above the mean something is.
z = (data - mean)/std
Bernouli Trial
(3)
- Result of each trial is a successs or failure
- Probability p of success is the same in every trial
- Trials are independent.
Binomial random variable
x = number of successes
n = no. of repeated Bernouli trials
p = probability of success
p^x (1-p)^(n-x) times the binomial coefficient nPr
Finding binomial coefficient
___n!___
(k! (n - k)!)
Normal/Gaussian distribution
(4)
- Symmetrical about the mean
- Bell shaped
- mean, median and mode are the same
- The two tails never touch the horizontal axis
Mean in binomial distribution
mean = np
Sample variance in binomial distribution
sample variance = np(1-p)
Null hypothesis
What we assume to be true.
i.e: there is no significant difference
Alternative hypothesis
What we are testing.
ie: There is a significant difference
Type 1 error
Incorrect rejection of the null hypothesis
Type II error
Incorrect acceptance of null hypothesis
Chi-squared test
Do the number of data in different categories fit the null hypothesis?
Look up test stat on table
Degrees of freedom = categories - 1
Chi squared equation
Chi squared = sum (O-E)^2 / E
O = observed frequency
E = expected frequency
Limits on expected numbers - Chi squared
(3)
- No expected category should be less than 1
- No more than a fifth of the expected values should be less than 5.
- It doesn’t matter what the observed values are.
What to do if expected numbers don’t fit
- Collect larger samples
- Amalgamate categories
Regression Analysis
Fits a straight line to a scatterplot.
* x is the independent variable
* y is the dependednt variable.
SSE
SSE = Sum of squared differences between actual and predicted y values according to the regression lines
Sum of Squares Error
Finding regression line equation
m = sum of each (x-mean)(y-mean) all over sum of each (x-mean)^2.
Gives gradient of line.
Can find intercept using mean values of x and y
Regression line always goes through…
The means of x and y
(xmean, ymean)
Sum of x-mean = y-mean =
0
ANOVA
Analysis of variance.
Compares difference between the predicted and the mean (regression) and the actual value and the predicted value (error). These values are squared and summed then the SSR is divided by the Total to give R^2.
SEM
Standard error of the mean
SD divided by square root of sample size.
Correlation coefficient
Square root of r^2 given by the equation in ANOVA. GIven the sign of the slope.
Non-parametric tests
(5)
- Spearman’s rank correlation
- Mann-Whitney test
- Wilcoxon paired sample test
- Kruskal-Wallis
- Friedman
Non-parametric tests
Definition
For when data are not normally distributed
Spearman’s rank test
(4)
- Measures the strength of association between two variables
- Non-parametric
- Use when the variables are not normally distributed, or the data are ordinal.
- Gives an r value
Mann-Whitney test
- Non-parametric equivalent to the unpaired t-test.
- Tests for significant differences between medians of two independent group.
- Uses ranking
- Uses table value of U
- If calculated is lower, we reject Ho
Wilcoxon paired sample
- non-parametric equivalent to the paired t-test
- Tests for significant differences between medians of two paired observations.
- Uses table value
- If test stat smaller, we reject Ho
Kruskal-Wallis
- non-parametric one way analysis of variance
- Alternative to one-way ANOVA
- Detect differences in the medians between 3 or more treatments of different subjects.
- Extension of Mann-Whitney for more groups
- Sample size doesn’t need to be the same.
- Test statistic is compared with Chi squared distribution.
Friedman’s
- non-parametric two way analysis of variance
- non-parametric alterantive to two-way ANOVA
- detect difference in the medians between 3 or more treatments of the same subjects
- size of the sample must be the same
- Gives H value that is compared to chi squared table
Negative skew
mean < median < mode
Positive skew
Mode < median < mean
t-test
- unpaired
- assumes equal variances
One sample t-test
Is there a diffreence between the group and the population?
Is the mean what it should be?
Two sample t-test
Are the means the same?
Paired samples t-test
is there a difference between the mean at two points in time
Can counterbalance to remove extraneous variables.
Shapiro test
Checks if data is normal
ANOVA of multiple groups
compares means of 3 or more groups.
Tells us if there is a difference, but doesn’t tell us where.
If you get a significant result you can run a hsd test instead.
Welch test
student t-test assumes an equal variance, if variance is not equal, we do a WElch t-test instead.
Bonferoni Correction
When multiple tests are done, the p value compounds and the probability of type I error increases.
Bonferroni corrected p value = original / no tests