Discovering Statistics Flashcards
What is validity?
The degree to which a theory/model reflects a true/accurate picture.
What is reliability?
The replicability of results
What are the characteristics of a normal distribution?
Symmetrical Bell shaped curve Standard Deviation determines steepness Unimodal Continuous
What percentage of values fit within +/- 1.96 standard deviations in a normal distribution?
95%
What is the standard error?
The standard deviation (variability) of the sampling distribution
What are point estimates?
Single numbers used to guess corresponding population parameters.
What are examples of point estimates?
Measures of central tendency such as mean median and mode
Measures of dispersion such as range and standard deviation
Relationships such as correlations
What are interval estimates?
uncertainty quantified around point estimates (smaller intervals mean more confidence and less uncertainty)
What are confidence intervals?
range of values that’s likely to include a population value with a certain degree of confidence. E.g 95% Confidence interval means that 95% of samples will include the population mean
What is the t-distribution?
A way of approximating confidence intervals if the sampling distribution mean is not known. It is centred around 0, symmetrical and its shape changes based degrees of freedom (df=infinity, the distribution is normal)
What are the three levels of hypothesis?
Conceptual
Operational
Statistical
What is the scientific method?
Observation Theory Hypothesis/predictions Test hypothesis Interpret data Reach conclusions + generate more hypotheses
What is the linear model?
To obtain the value of an outcome from one or more predictors
What is the equation for the general linear model?
Outcome = b0 (intercept) + b1(predictor) + e (error)
What is b0 (intercept)?
The value of the outcome when the predictor is 0
What is b1?
The change in the outcome for every unit change in the predictor (slope)
What value is used to establish the significance of b1?
T value measures how many SD our estimate is from 0, we want it faraway from 0 as possible to reject null hypothesis
How is model fit evaluated?
R2 and adjusted R2
Always lies between 0 and 1, near 0 means does not fit the variance, 1 means good fit
What is the F stat?
The statistic that indicates whether there is a relationship between outcome and predictor. The further the f is from 1 means there is a relationship
What are outliers?
A value in the data that does not follow the trend
How can outliers be detected in the GLM?
Graphs Standardised residuals (if difference between observed and predicted is more than 3 its is outlier) Cooks distance (more than 1 is outlier)
If outliers are present, what should be done?
A robust estimation model should be used instead of OLS model as they are more resistant to their influence
What are the assumptions of the linear model?
Linearity and additivity
Normally distributed
Independent errors
Homoscedastic errors
What are the differences between errors and residuals?
Errors refer to difference between observed and predicted values of the population - this cannot be observed
Residuals refer to difference between observed and predicted values of the sample
What are independent errors?
Errors in one prediction that are unrelated to errors in another
What are homoscedastic errors?
Variance of residuals should be consistent at different levels of the predictor variable.
What is heteroscedasticity?
The number of residuals are more on one side of the spread creating a funnel shape
What should be done if the assumption of normal distribution is not met?
Normality can be ignored as long as sample size is big enough die to central limit theorem (at least 30 samples), if small sample size is used - bootstrapping can be used.