QDA Flashcards
3 ways to numerically summarise a categorical variable
- Frequencies or counts
- Relative frequencies
- Relative cummulative frequencies
How can categorical variables be summarised visually?
Bar and pie charts
What are bar charts for and what are the y and x axis
Representing frequencies of each of the different categories, the y axis is the frequencies and the x axis are the categories
What are pie charts for?
Representing the frequencies of each of the different categories as a slice of pie
When describing the contents of a numerical variable we can look at different aspects of its distribution such as:
Measures of location such as the mean
Measures of spread and variability
Extreme values
When is a t.test used?
When variables are independent and the errors are normally distributed. Use the mean to calculate
What is Wilcoxon rank sum test?
How does it work
Non-parametric alternative to a t.test. (Used when we cannot assume a normal distribution)
Puts all measurements into one column and assigns a value to each value
What does a scatter plot do?
What to look for and how to interpret
Display two numerical variables of interest along the x axis (independent) and y axis (dependent)
Whether it has a positive relation, linear, quadratic or exponential, strong relation, clear relation or outliers
Two main types of analysis
Descriptive - Describing data using numerics or graphical
Inferential - Using sample data to make a conclusion on larger populations
What are the main data types?
Categorical - Attributes observes for sampling unit. Binary categories
Numerical - Numerical value on a discrete, ordinal or continuous
What is a confidence interval?
The likely range the mean/proportion would fall in if the exercise was repeated
P value rule
P value <= a = Reject the null (significant)
P value > a = Fail to reject null (not significant)
(P value should be less than 0.05 for any difference to be significant)
What does it mean to test a null hypothesis?
It is what you’re trying to disprove. It is the given facts
The mean has a specific value against an alternative hypothesis.
H0: u = u0
H1: u =/ u0
What are the type 1 and 2 error probabilities
a = p(type 1 error) = p(reject H0 | H0 is true)
B = p(type 2 error = p(fail to reject H0 | H0 is false)
How to test for normality
Quantile - Quantile (Q-Q plot)
Numerical tests for normality
Kolmogorov-Smirnov (K-S) test
Shapiro-Wilks test
How do you test for variance?
Give the definition of each test
ANOVA is the main test for variance as it used to determine if there is a statistically significant difference between two or more categorical groups by testing the differences of means by using variance
Fishers F test which involves dividing the larger variance by the smaller variance
What is a prop test?
A test to find the confidence interval for the mean of a population from a sample (proportion)
Testing the proportions in several groups are the same by using their means
What can we use for hypothesis testing?
T.test. The mean for a sample from a population
What is a correlation test used for?
Provide an example
Used for numerical data as a pre-step to linear regression
Eg speed vs distance
What type of variables does linear regression use?
Two continuous variables that are numeric for both the independent and the dependent
What is a pearsons chi squared test (x2)
Used to discover whether there is a relationship between two categorical variables
What is the difference between a one way ANOVA and two way?
One way ANOVA is a parametric test used to determine whether there are any significant differences between the means of two or more independent variables
Two way ANOVA is testing the effect of two independent variables on a dependent variable
How to graphically show the variance in a categorical and continuous variable?
Box plots
Name 4 diagnostic plots to test lm models
Residual vs fitted
QQ plot
Scale location
Residuals vs leverage
What look for in residual vs fitted diagnostic plot
It should look scattered otherwise suggest issues with model assumptions
What to look for in a QQ plot
Needs to be a straight line for all plotted values
How to analyse models in lm
Discuss coefficients
Linear relations
Significant SSR and SSE ratio
R2 value
Outliers
Unwanted patterns in residuals requiring transformation
Check if they fit the assumption of homoscedascity
What different types of models are there?
Linear regression
Multiple regression
ANOVA and ANCOVA
Logistic regression
When do you use ANOVA?
When do you use One way and multi-way
When all explanatory variables are categorical
One way is used when there is one factor or categorical independent variable
Multi way is when there’s more than one categorical independent variable
Different types of transformation techniques for models
Log dependent variable
Square the independent variable
1/ the independent variable
Joining categories
How to graphically represent a fully numeric dataset
What does it do?
Through the plots()
It plots all the numeric variables all at once with each other
When do you use a logistic regression?
When all the variables are categorical and the dependent variable is binary
What do you need to use a chi squared test for?
When you want to find out if two variables are independent
If the expected frequencies of the categorical variables are less than 5 then use a fishers exact test