Research methods and statistics 3 (year two) Flashcards
Explain and define P-Hacking
Method of manipulating data to get significant results
Multiple analyses
Omitting info
Controlling for variables
Analyse partway through then collect more data
Changing DV
Explain how outliers can be an issue
Outliers in small sample sizes can be the difference between a significant and non-significant result
Non-parametric correlations can combat this (e.g original test is pearson correlation, NP is spearman’s rho)
Define regression and what it tells us
A test of if two or more variables can predict variance in an outcome variable
E.g clinical psychologist may want to know what variables are associated with psychosis symptoms
Tells us:
If model is a good fit
If there are significant relationships between a predictor variable and an outcome variable
The direction of the relationships
Can then make predictions beyond our data
Predicts a line of best fit for association between variables
Give the linear regression equation
Yi= (B0+B1Xi) + ei
Yi = outcome/variable you’re predicting
B0 = intercept, constant – mean value of outcome variable if the predictor in model is 0. Positions line at intercept
B1Xi = predictor variable, tells you the shape of line of best fit (also called parameter estimate)
Ei = error term, amount of variance left over in model
Define beta slope
Slope aka beta: number of units change in the dependent variable for every 1 unit change in the IV
Give the assumptions for regression
Normally distributed continuous outcome
Independent data
Ratio/interval predictors
Nominal predictors with two categories (dichotomous)
No multicollinearity for multiple regression
Careful of influencing cases
Give the parameters needed to work out how well the regression model fits the data
To work out how well the model fits the data we need to know:
Sum of squares total (SST)
Used to generate test statistic – ideally as high as possible
Proportion of improvement due to model
Sum of squares residual (SSR)
Sum of squares model (SSM)
SST uses difference between observed data and mean value of outcome
SSR uses difference between observed data and regression line
SSM uses difference between the mean value of Y and the regression line
Give the equation and components for generating a regression test statistic
Test statistic tells us the ratio of explained vs unexplained variable in the outcome
F test (Model fit) = MSm
Msr
MSm = means of the squares of the model
MSr = means of the squares of the residual
F test tells us if it is a good fit of the data – are we explaining variance?
Define proportion of total variation and give the equation
Proportion of total variation (SST) that is explained by regression (SSR) is known as the coefficient of determination and referred to as R^2
R^2 = SSR
SST
R2 can vary between 0 and 1 and often expressed as %
R2 is not that useful if you have more than one predictor variable – more than one = r2 adjusted
Adjusted r2 = how effective the model is
Explain when multiple regression is needed
Two or more variables to predict our outcome
To improve explanatory potential – examine which predictors are statistically significant
Give the equation for multiple regression
Yi= (B0+B1X1i+B2X2i) + ei
Explain the spss output for simple regression
Variables entered/removed allows you to double check the info you put in
Model summary: gives R2 statistic – always report adjusted R square
ANOVA: tells us about our model fit (is model a better fit than just using the mean) – F-test
Coefficients: tells us about the individual predictors in our model – whether they are significant and their direction (unstandardized coefficients)
Give an example APA style writeup for simple regression
A simple regression was carried out to investigate the relationship ——- and ——. The regression model was significant and predicted approximately % of variance (adjusted R2 = .-;F(X,Y) = -, P=-). ——– was a significant/insignificant predictor of ——– (b=.-(s.e=.-); % - to -; t=- p=-)
Define multicollinearity
Multicollinearity: occurs when independent variables in a regression model are highly correlated
If two/more predictor variables in model are highly correlated with each other they do not provide unique/independent info to the model
Can adversely affect regression estimates
Large amounts of variance explained but no significant predictors
Explain how to identify multicollinearity
Identifying multi-collinearity
Look for high correlations between variables in a correlation matrix ( r>.8)
R = 1 is perfect MC – data issue
Tolerance statistic
Percentage of variance in IV not accounted for by other IVs
1 – R2
High tolerance = low multicollinearity
Low tolerance = high multicollinearity
Variance inflation factor
1/tolerance
Indicates how much the standard error will be inflated by
Give ways of fixing multi-collinearity issues
Fixing multi-collinearity issues
Increase sample sixe
Remove redundant variable
If two or more variables are important, create a variable that takes both of them into account
Give an example AQA style writeup for multiple regressions
A multiple regression was conducted to investigate the roles of –, – and – on —. The regression model was — and predicted -% of variance (adjusted R2=-:F(-,-) = -, P=-). Variance inflation factors suggest multicollinearity was not/ was a concern (–=-, – =, –=-). – was a significant/non-significant predictor of — (b=- (s.e = -); % Ci – to -; t=-, p=-) and – was not a significant/was significant predictor (b=- (s.e = -); % Ci – to -; t=-, p=-)
- Explain what a mediator is
- Links two variables
- Mediator is a variable that is affected by the IV, mediator influences DF
- Effect of the IV on the DV (IV-DV) is partially dependent on the mediator (IV-M-DV)
- IV-DV = direct effect ( c )
- IV-M = a-path
- M-DV = b-path
- Full mediation: inclusion of mediator renders direct IV-DV effect non-significant
- Partial mediation: inclusion of mediator renders direct IV-DV effect less significant
- Explain the difference between mediation and moderation
Mediation: a variable that accounts for an association between a predictor and a DV
- Moderator: affects the strength of a relationship between a predictor/DV
o Moderator does not have to be associated with the IV or DV
- Mediator MUST be something that can change, e.g age cannot be a mediator, craving can
o IV has to influence mediator, nothing can influence age
- Give some issues of the causal steps approach
- Has little or no sensitivity at all (needs huge sample)
- Mathematically incorrect (IV DV association not necessary)
- Is unable to detect suppression effects
- Explain why the Sobel test is a bad solution
Gives a p value for the indirect effect
- Based upon a product of the coefficients calculation
- Assumes the product of the coefficient is normally distributed – this is almost never the case
- This method also requires more participants to detect indirect effects than the methods used today
- Explain why the joint significance test is a good solution to mediation
This method ignores the IV-DV association (doesn’t have to be significant)
o If a path and b path are significant there is evidence of mediation
- Also give confidence intervals for the for indirect effect
- Explain how to run an SPSS analysis for mediation
example: IV = personality disorder, M= enhancement, DV = alcohol units consumed
- Firstly we produce two regressions
1. IV-M Personality disorder to enhancement
2. M-DV enhancement (+personality disorder) to alcohol units consumed
- We control for personality disorder in the second regression so we can be sure that the mediator is predicting variance beyond that accounted by the IV
- Analyse – regression – linear
- Regression 1: IV to M
- To do our additional test for mediation we need to take the unstandardized regression coefficient and its standard error and use it in Remediation (we can use three dp when using the program)
- Then run the second regression M-DV but controlling for PDQ-4 (personality disorder)
o Enhancement (and PDQ-4) to Units consumed
o If enhancement is significant in this regression then there is evidence of mediation
- Explain how to writeup mediation in APA format
The first regression IV-M
o The regression is significant (R² adjusted = 0.15, F(1,225)=41.32, p
- Explain how to use confidence intervals
95% confidence intervals reflects how confident we can be in our regression coefficient, it expresses the precision of our estimate, 95% of samples from this population will fall in this range (if we give 95% CI’s we could give 99% CI’s etc)
- High precision= “tighter” CI, this is a good thing it shows consistency in an effect.
- If it overlaps with 0 this means there will not be a significant effect as the range of predicted values overlaps with no effect (0= no change)
- If they don’t overlap you have a significant effect p
Explain the function of some spss values given in regression
Model fit: F test
Amount of variance explained: adjusted R^2
Significance of individual predictors - betas
Explain the difference between simple multiple and hierarchical regression
Explain why we might use hierarchical regression over simple multiple
Simple/multiple give us model fit and R squared which accounts for all predictors in our model
Simple/multiple is a simultaneous model
Hierarchical models – some strategy (or specified hierarchy) which is dictated in advance by purpose/logic of research
Allows us to be more theory driven
Allows us for adjust for variables
Partitions our explained variance
Define and explain steps in a hierarchical regression
Groups of variables that we want to look at as a distinct set of predictors
Often people have variables they want to adjust for in one step e.g age and gender
May want to put questionnaire measures in one step and behavioural measures in another
Known predictors before hypothesised predictors
Fine to tabulate full regression analysis but report final model (including all steps) in the text
You should also describe the set up of the regression model (i.e which variables were entered on each step and why)
Define and explain adjusted r^2 change
This tells us how much adjusted R^2 (ie the amount of variance in the outcome predicted by the model) changes with the addition of a new step after controlling for each previous step
SPSS calculates an F-change, simply an ANOVA F value telling you whether the step (R^2 change) predicts a significant amount of variance after controlling for previous steps
These should be reported
Explain why outliers are an issue and how they can be identified
Outliers/influential cases can have considerable effect on our regression parameters
Outlier is not condition: not conditional on any other variable
Only on dependent variable
Can check outliers using box plots
Define leverage and explain how it works
Tells us about extreme data points on the X variable
Distance between Xi and all other X data points
Between 0 and 1 (lower = better)
Sum of the leverage values = number of predictors in the model
Cut offs: 3, 3 times number of parameters divided by number of data points
Define and explain Cook’s distance
If any outcome variables have high residuals, they may distort accuracy
Tells us how predicted Y values will move on average if the data point is removed (a lot of change indicates a large amount of influence)
As with variance inflation factor, no accepted cut off – MANY DIFFERENT
E.g >1, 4/number of participants, 3 times larger than the mean
Define DFBETAs and give the cut-offs
The difference in the regression co-efficient with the data point included-excluded
Expressed as change in standard deviation
Cut offs:
2/SQRT(n)
Value = 1
If cooks’ distance on exam: 4/N-K-1) (k= NUMBER OF INDEPENDENT VARIABLES)
If DFBETAs: 2/SQRT(n)
Explain the SPSS output for hierarchical regression
Unstandardized B (and confidence intervals) – how much does BMI increase for every 1 unit increase in our IV
Standardised B – allows us to make comparisons across predictors
R2 change is used to assess how much (more) variance is explained at each step.
F-change tells us whether the amount of extra variance explained is significant.
Give an example APA style writeup for hierarchical regression
A hierarchical linear regression was conducted to example the effects of age, gender, self-control, eating restraint and childhood attachment on BMI. Age and gender were added to step one of the model, and self-control, eating restraint and childhood attachment added to step two. Variance Inflation Factors suggest multicollinearity was not a concern. The final regression model was significant and explained 54 % of variance (F(5, 94) = 24.60, p < .001) Age was a significant positive predictor of BMI and self-control was a significant negative predictor. Gender, Eating Restraint, and childhood attachment were not significant predictors (see table…)
Describe reliabilty
• Refers to the consistency of a measure – it is essentially about whether a measure is consistent. • Psychologists consider three types of reliability: 1. Over time (test-retest reliability) 2. Across items (internal consistency) 3. Across different researchers (inter-rater reliability) • Reliability measures commonly take the form of correlation coefficients but there are different methods available.