Statistics Flashcards
Define Population:
Full set of units that we are interested in
Define Sample:
A subunit of units that we experiment on or observe
Why do we use a sample?
To draw inferences about the population
Why won’t we get the right answer from sampling units?
Role of chance
What is hypothesis testing?
Suggesting something is unlikely to be true is rather easier
What are steps of formulating a hypothesis testing?
- Formulate a hypothesis
- Formulate a null hypothesis
- Calculate the chance that you might see your data if the null hypothesis is true (p value)
What is p-values?
Probability that you might see something as extreme or more extreme
What do you do if p<0.05 in the old school approach?
- Significant result
- Reject null hypothesis
- Accept alternative hypothesis
How do we interpret p-value in the modern approach of continuum of evidence?
- 1 =
- 05 =
- 01 =
- 001 =
- 1 = Weak evidence
- 05 = Moderate evidence
- 01 = Strong evidence
- 001 = Very strong evidence
What is wrong with the old school approach?
Effectively by using strict cut-off we interpret p<0.05 as statistical proof
-Does’t represent how strong the evidence is
What are the basic types of data?
- Numerical
- Categorical
What is numerical data?
Any data that can be expressed with numbers
What are two many sub-types of numerical data?
- Continuous
- Count
What is continuous data?
Can take any value
What is an example of continuous data?
Height
Blood pressure
Time
What is count data?
Takes only integer values and represents a count of discrete things
What is an example of count data?
Number of time to A&E
Number of children
What is categorical data?
Things that do not have an inherent numerical value
What are the main subtypes of categorical data?
- Nominal
- Ordinal
What is nominal data?
Things with inherent order
What are examples of nominal data?
Eye colour
Blood type
What is ordinal data?
Things with an inherent order
What are examples of ordinal data?
Large/Small
Education level
-Age group
What is descriptive statistics used for?
To describe the data in you sample
What is inferential statistics used for?
To draw inferences about the population from the sample
Summaries categorical data:
- Data can take on 1 of a number of categories
- Number of categories is small
- Use of table frequency
What do frequency tables allow?
To see which category is most common, least common and which categories occur more frequently
What is a problem with frequency tables allow?
Can not see immediately what share of sample is contained in each category
What can you do to see what share of sample is contained in each category?
Percentages
What are types of graphical summary of categorical data?
- Bar charts
- Pie charts
What does the height of a bar chart represent?
Number of occurs
What does grouping data turn continuous data into?
Categorical data
What can you do instead of grouping data?
Plot histograms
What is the total area of histogram?
1
What equation is used to calculate the density of histograms?
Density = proportion in bin/bin width
Is there gaps between bins in histograms?
No
What does the height of a bin in a histogram indicate?
Relative frequency of observations
What does using density allow histograms to compare?
Different bin widths
If a histogram has a heavy tail does it have a high or low kurtosis?
High
If a histogram has a low tail does it have a high or low kurtosis?
Low
What does location mean?
Defines where data are located in the range of possible values
What are the three common measures of the averages used?
- Mean
- Mode
- Median
What is a mean?
Equal to the sum of values divide by the number of values
What is a median?
- Rank data in order
- Median is the middle number
- If even number of data points, no single point so take mean of 2 middle values
What is a mode?
Most commonly occurring value
What is dispersion?
Technical name for the spread or variability of the data
What are the three common measures of spread?
- Standard deviation
- Interquartile range
- Range
What is standard deviation?
Equal to the square root of the mean of the difference between values and the mean squared
What is the interquartile range?
Data where 25% of data is above and 25% of data below
What is the range?
Simply the smallest and largest values
What is an alternative to histograms?
Box and whiskers plots
What is good about box and whiskers plots?
Can be easier to compare between groups
What are outliers on a box and whisker plot?
More than 1.5 IQRs above the upper quartile
What level of skewness dies symmetrical data have?
0
What level of kurtosis is normally distributed data??
3
Does continuous and count come under numerical or categorical data?
Numerical
What are different types of continuous data?
- Blood pressure
- BMI
- Size of an orange
What are different types of count data?
- Number of headaches
- Number of people with diabetes
- Number of oranges
Does nominal and ordinal data come under numerical or categorical data?
Categorical
What are different types of nominal data?
- Ethnicity
- Blood types
- Variety of orange
What are different types of ordinal data?
- Disease severity
- Satisfaction rating
- Orange quality rating
What is the standard error on the mean?
Equal to the standard deviation of the sample divided by the square root of the sample size
When does standard error on the mean increase?
By increase in standard deviation
Why does standard error on the mean increase with increasing standard deviation?
More variability there is in the population the more the uncertainty in our estimate
When does the standard error on the mean decrease?
With increasing sample size
Why does the standard error on the mean decrease with increasing the sample size?
Bigger the sample size, the more information we have, the more precise our estimate
What is a t-test?
T-tests are used to test whether the means in two groups are different from each other
-Continuous data
What are inferential tests?
Testing whether the difference in our sample reflects a difference in the population
Are t-tests weakly or strongly related to to standard error on the mean?
Strongly
What does t-tests being strongly related to the standard error on the mean mean?
Standard deviation of the population increases our precision on our estimate is worse as our sample sizes go up the precision on our estimates is better
What are the assumptions of t-tests?
- Data in each group are normally distributed in population
- Variance (SD) is constant across groups
- Data points are independent of each other
What does data points are independent of each other mean?
Unrelated
How are t-tests assumptions broken?
- Before and after data on the same people
- Small sample from the same area
- Using same piece of equipment when collecting subsets of our data
What is t-tests with unequal variances?
Version of t-tests which assumes unequal, rather than equal variance
When would you use a t-tests with unequal variances?
Where standard deviations from two different groups are quite different
What are the assumptions for unpaired t-test assuming unequal variance?
- Normally distributed data in each group
- Independent data points
What are the assumptions for paired t-tests?
- Normally distributed data in each group
- Constant variance across groups
Why do we use ANOVA instead of t-tests?
-More than two groups
Why use ANOVA?
Look overall at the data and see if there are any differences by groups, rather than comparing individual groups to each other
What is ANOVA?
Analysis of Variance
-Partition the variance in the data to that high can be attributed between groups and that which is left over
How do you interpret an ANOVA?
-p-value telling how much evidence there is that there is some variability between groups
What is post-hoc pairwise comparisons?
Occur after an overall assessment
Occurs after ANOVA
Can you use post-hoc pairwise comparison with t-tests?
No
What are ANOVA assumptions?
- Data in each group are normally distributed in the population
- Variance (SD) is constant across groups
- Data points are independent of each other
So if testing data requiring equal variance use?
T-tests
If no independent data for t-test use?
Paired t-test
If no normally distributed data within groups use?
Mann-Whitney test
What is the Mann-Whitney test?
An alternative to a t-test when we have non normally distributed data in each group
Comparing continuous data between two groups
What is a generalisation of the null hypothesis of t-test?
Two groups have the same mean in the population
What is the generalisation of the null of hypothesis of Mann-Whitney test?
If we select one value at random from each group, the value from the first group will be larger than the value from the second group 50% of the time
What are non-parametric tests?
Make no assumptions about the form of the data
What do you apply where a t-tests is appropriate yielding a larger p-vaule?
Wilcoxon test
When do you use a Wilcoxon signed-rank test?
Paired data
When can you use a Kruskal Wallis Test?
More than 2 groups
What are alternative forms of the Mann-Whitney test for use of non-normally distributed data and are analogous to paired t-tests and ANOVA?
- Wilcoxon signed-rank test
- Kruskal Wallis Test
What tests categorical data?
Chi-squared test
What are assumptions for chi-squared test?
- Data points are independent
- Data are described by the binomial distribution
- At least 5 expected counts
What are used instead of chi-squared test if a small expected count?
Fishers exact test
What test does correlation analyses?
Pearson’s correlation coefficient aka rho
What are the correlation numbers taking place?
1 to -1
What does the different correlation numbers mean?
1 =
0 =
-1 =
1 = Perfectly correlated 0 = No correlation -1 = Negatively correlated
What are the assumptions of correlation analyse?
- Data points are independent
- One set of data is normally distributed for any given value of the other with contacts variance
- Relationship is linear
What is linear regression?
Similar to correlation looks at relationship between two continuous variables
First a functional form
y = bx+ c
What can you get from linear regression?
- p-vaule
- R-squared value
What is the r-squared value?
Proportion of variance
Assumption of linear regression:
- Data points are independent
- Outcome data is normally distributed for any given value of the exposure
- Outcome data has a constant variance for all values of the exposure
- Relationship is linear
Choose a statistical test - 2 continues variables:
If the assumption of normally, constant variance or linear relationship are not met
Spearman’s correlation
Choose a statistical test - 2 categorical data:
If the assumptions of at least 5 expected counts in each cell is not met
Fisher’s exact test
Choose a statistical test - 1 contentious and 1 categorical variable:
-Binary categorical variable
T-tests
Choose a statistical test - 1 contentious and 1 categorical variable:
-categorical with more than 2 categories
ANOVA
Choose a statistical test - 1 contentious and 1 categorical variable:
-Assumption of normality not met for a binary catergorical variable
Mann-Whitney test
Choose a statistical test - 1 contentious and 1 categorical variable:
-Assumption of normality not met for categorical with more than 2 categories
Kruskal-Wallis test
Choose a statistical test - 1 contentious and 1 categorical variable:
If constant variance is not met for a binary categorical variable
T-test for unequal variance
Choose a statistical test - 1 contentious and 1 categorical variable:
If independence is not met for a binary categorical variable
Paired t-test
Choose a statistical test - 1 contentious and 1 categorical variable:
If independence is not met for categorical with more than 2 categories
Repeated measures ANOVA
What do confidence intervals indicate?
Range of plausible values for the thing we are trying to esitmate
If a confidence interval includes zero or no zero difference p>0.05?
Zero
If a confidence interval includes zero or no zero difference p<0.05?
No zero
What can error bars show?
Standard errors
Standard deviations
Confidence intervals
Key information to include with figure legends for graphs?
- Meaning of different symbols
- What error bar representing
- Provide p-value
- State what statistics are used
- Describe everything on graphs
- Sample size number