Parametric Tests and Assumptions Flashcards

Question 1

Q

What do parametric tests assess?

What is required to run them?

Answer

A

Parametric tests look at group means
Require data to follow a normal distribution
Can deal with unequal variances across groups
Generally are more powerful
still produce reliable results with continuous data not normally distributed if sample size requirements met (CLT)

Question 2

Q

If data does not meet parametric assumptions what non parametric tests would you use?

Answer

A

Correlation tests, which are non parametric versions/ So for example, a Spearman’s Correlation Test.
Non parametric tests also assess group means, they just don’t require a normal distribution.

Question 3

Q

What is the loop hole with parametric tests when continuous data is not normally distributed (therefore according to assumptions, perhaps should choose non parametric one?)

Answer

A

Loophole is that sample size requirements are met due to central limit theorem. In these cases you can still produce reliable results.

Question 4

Q

What do non parametric tests assess?

How is this different to parametric tests?

Answer

A

Group MEDIANS
Don’t require data be normally distributed
Can handle small sample sizes

Because parametric tests assess group means. They require a larger sample size.

Question 5

Q

What is one easy question to ask ourselves when figuring out whether to choose parametric or non parametric?

Answer

A

What is the sample size we’re working with.

Non parametric can deal with small sample sizes, parametric not so much..

Question 6

Q

What are the four parametric test assumptions?

Answer

A

Additivity and linearity
Normality
Homogeneity of variance
Independence of observations

Question 7

Q

What is this equation?

y(i) = b(0) + b(1)X(1) + e(i)

Answer

A

This is the standard linear model (that describes a straight line), and we see this when looking at additivity and linearity

Question 8

Q

What does the Y, B(0) and B(1) and E(i) stand for in the below?

y(i) = b(0) + b(1)X(1) + e(i)

Answer

A

Y(i) = the Xth persons score on the outcome variable
B(0) = The Y intercept - the value of Y when X = 0
B(1) = the regression coefficient for the first predictor (so the gradient of the regression line (slope) and the strength of the relationship
e = the difference between the actual and predicted value of the Y for the (i)th person.

Question 9

Q

What does the standard linear model equation describe?

Answer

A

Both the direction and the strength between the ASSOCIATION of the X and Y variable. Always have an error term at the end.

Question 10

Q

What does the E at the end of the standard regression equation represent

Answer

A

The difference between the actual observed data point and the LINE the we drew in the data points. That’s each data point (or persons) residual or error.

Question 11

Q

In parametric tests are we adding terms together or multiplying? If so, why?

Answer

A

Because predictors do not DEPEND on the values of other variables.
We use additive data, so x1 and x2 predict T.
So the predictors (variables) and their effect, added together, lead to an outcome which is a linear function of predictors x1 + x2.
Basically linear and additive data say X1 and X2 predict Y.

Question 12

Q

Basically, what does linear and additive allude to?

Answer

A

That x1 and x2 predict y

Question 13

Q

Why are variables not multiplied in linear equations?

Answer

A

Because we are looking at linear relationships which involve adding terms together. Not multiplying. Adding the predictors together says that the outcome, or DV, is a linear function of the predictors AND their effects

b(0) + b(1)X(1) + e(i)

Question 14

Q

How do we deal with assumptions for ANOVA?

Answer

A

Independent observations: Repeated measures
Normality – transform or use Kruskal Wallis
Homogeneity of variances – test with Levene’s test, use Brown-Forsythe or Welch F

Question 15

Q

How do we deal with assumptions for correlations?

Answer

A

Normality – Use Spearman correlation

* Linearity: if monotonic, use Spearman, otherwise transform

Question 16

Q

How do we deal with assumptions for regression?

Answer

A

• Continuous outcome (otherwise use nonlinear methods)
• Non-zero variance in predictors
• Independent observations: Repeated measures
• Linearity – check with partial regression plots, try transforming
• Independent errors: For any pair of observations, the error terms should be
uncorrelated
• Normally-distributed errors: The errors (i.e., residuals) should be random and
normally distributed with a mean of 0
• Homoscedasticity: For each value of the predictors, the variance of the error term
should be constant

Question 17

Q

How do we deal with assumptions for multiple regression?

Refer to Multiple
Regression lecture
slides #19-32

Answer

A

The above, and also multicollinearity – delete or combine

Question 18

Q

How do we deal with assumptions for moderation?

Answer

A

One IV must be continuous (if both X and M are categorical, use factorial ANOVA)
Each IV and Y, and interaction term and Y, should be linear – try transforming

Question 19

Q

Why would the best central tendency measure for your data sometimes be a median, and other times be a mean?

Answer

A

Generally the mean is best but media is preferred measure of central tendency when there are a few extreme scores in the distribution of the data (a single outlier can have a great effect on the mean)

Or, perhaps there are some missing values in the data.

Question 20

Q

What does the Gaussian distribution or bell curve mean?

Answer

A

Normal distribution.

Question 21

Q

What are the four assumptions for parametric tests?

Answer

A

Additivity and linearity
Normality
Homogeneity of variance
Independence of observations

Question 22

Q

y(i) = b(0) + b(1)X(1) + e(i)

What is this equation telling us? Which parametric test assumption is it associated with?

Answer

A

THE STANDARD LINEAR MODEL for additivity and linearity

Y(i) = the Xth persons score on the outcome variable
B(0) = The Y intercept - the value of Y when X = 0
B(1) = the regression coefficient for the first predictor (so the gradient of the regression line (slope) and the strength of the relationship
e = the difference between the actual and predicted value of the Y for the (i)th person.

Question 23

Q

With the standard linear model, how many X variables can be added to an equation for a straight line?

Answer

A

However many as you like!

Question 24

Q

What is Y in the standard linear model equation?

Answer

A

The outcome variable

Question 25

Q

What does the little i next to the y(i) in the below equation represent?

y(i) = b(0) + b(1)X(1) + e(i)

Answer

A

Each individual.

Question 26

Q

What does the b(0) represent in the below equation?

y(i) = b(0) + b(1)X(1) + e(i)

Answer

A

the Y-intercept (Value of Y when X=0)

Most importantly, it is the POINT at which the regression line crosses the X axis.

Question 27

Q

What does b(1) represent in the below equation?

Answer

A

It’s the first predictor, but more specifically it’s the regression coefficient for this predictor. It’s the EFFECT. Regression coefficient = effect.

It’s the SLOPE of the regression line, and it’s the direction/strength of the relationship.

It’s the direction and strength of the magnitude between the ASSOCIATION of the x and y variables . So we would repeat this for another x2. so b(2) would become the effect for that X

So that’s why it’s the effect

Question 28

Q

What does the e(i) represent in the below equation?

y(i) = b(0) + b(1)X(1) + e(i)

Answer

A

The e(i) is the difference between the actual and predicted value of Y for the ith person.

It’s the DIFFERENCE between the actual data point and the line that we drew in the data points - it’s each persons residual or error.

Question 29

Q

Why are error terms and residuals important?

Answer

A

Because we can’t predict everything perfectly. Plotting true data points won’t always follow a straight line. They will fall a bit off the line.

Question 30

Q

What is the outcome y telling us about x1 and x2 and the association?

Answer

A

That X1 and X2 predicts y. And that Y is an outcome of the additive combination of the EFFECTS of X1 and X2.

Question 31

Q

So we’ve looked at what additivity means, but how can we assess linearity? How do we know if a relationship is a straight line?

Answer

A

By plotting the observed vs. predicted values (where we would want to see them symmetrically distributed around a diagonal line) (like QQ plot)
By plotting residuals vs predicted values (when you have horizontal line and symmetrically distributed dots around it)

Question 32

Q

When observing the plots showing QQ and residuals vs predicted, what would tell us if violated?

Answer

A

Looking out for a bow shape. Or just in general if it’s looking like the dots are curving away from the diagonal line.

Question 33

Q

How do we fix when linearity appears to be violated due to bow shape?

Answer

A

By applying a NONLINEAR transformation to variables
By adding another regressor that is a nonlinear function - polynomial curve
Examine the moderators

Question 34

Q

So now that we have looked at additivity and linearity, what is normality when it comes to parametric test assumptions?

Answer

A

Not about data being normally distributed only.
But, the normal distribution is relevant to:

Parameters (sampling distribution)
Residuals / Error Terms (confidence intervals around a parameter or null hypothesis significant testing)

Question 35

Q

Why, when looking at the assumption of normality, is it not enough to say the data is normally distributed so that’s fin?

Answer

A

Because the CLT says as the SAMPLE size gets closer to positive infinity (larger) then the sampling distribution, NOT the data, approaches normality.

Question 36

Q

What does the central limit theorem say and how does this influence how we interpret normality for the parametric test assumption?

Answer

A

As the sample size increases towards infinity, the sampling distribution approaches normal. There is an equal probability of selecting a value 0 to 1, therefore it’s uniform.

In bold: The CLT says the means are normally distributed .

So, the means were calculated using data from a uniform distribution, but the means themselves are NOT uniformly distributed. Instead, the means are NORMALLY distributed.

If you collect samples from distributions, whatever types, the means will be normally distributed. CLT says who cares where your data comes from.

The sample means will always be normally distributed. So we don’t need to worry about distribution. That’s why we look at normality in a different way for this assumption of normality for parametric tests.

Question 37

Q

What can the sample means collected from a data sets be for?

Answer

A

Make confidence intervals
Do T tests that ask if there is a difference between the means from two samples
ANOVA where we ask if there is a difference among the means of three or more samples
and any other statistical test that uses a sample mean

Question 38

Q

True or false: Samples means will be normally distributed always?

Question 39

Q

For the central limit theorem the sample size needs to be at least 30: True or False?

Answer

A

False. This is a rule of thumb that is generally considered safe but you can break the rule - Michelle used a sample size of 20 once.

Question 40

Q

What is the fine print for the CLT?

Answer

A

In order for it to work at all, you have to be able to calculate a mean from your sample.

Question 41

Q

True or false: even if data itself is not normally distributed, the means from the sampling distribution are normally distributed

Question 42

Q

True or false: the individual distribution of the data needs to be normally distributed according to the CLT

Answer

A

FALSE. The sampling distribution will approach normality even if Y not normally distributed

Question 43

Q

If data is skewed off to the right, Is this negative or positive?

Question 44

Q

If the normal distribution appears quite flat, so has half the height, what type of kurtosis is this?

Answer

A

Negative kurtosis

Question 45

Q

If the normal distribution appears quite tall , so is about a third taller than usual what type of kurtosis is this?

Answer

A

Positive kurtosis

Question 46

Q

If a normal distribution is looking kind of fat, what would it be named? (three options)

Answer

A

Either Leptokurtic, Mesokurtic, Platykurtic

Question 47

Q

What is Leptokurtic?

Answer

A

When the data is looking tall AND fat. Heavy

Question 48

Q

What is Mesokurtic?

Answer

A

When the data is looking normal type of height but FAT

Question 49

Q

What is Platykurtic?

Answer

A

When the data is looking short and fat and like a platypus could fit under it. Light

Question 50

Q

What is kurtosis?

Answer

A

The “heaviness” of the tail

Question 51

Q

What is skewness?

Answer

A

The symmetry of the distribution

Question 52

Q

What are skewness and kurtosis both falling under?

Answer

A

Properties of frequency distributions

Question 53

Q

What are we looking at when we’re talking about a distribution and graphing it with a histogram?

Answer

A

We are looking at the FREQUENCY of how often the data in that range occurs.

Question 54

Q

Do we know the sampling distribution for our data?

Question 55

Q

How do we test the data to establish where it meets normality assumption?

Answer

A

Check data or residuals using Q-Q plot of Histograms.

Question 56

Q

What does the QQ plot assess?

Answer

A

It compares sample quantiles to quantiles of normal distribution.

It checks the data/errors. If points are mostly on a straight line, normality is better.

Question 57

Q

So we can assess normality by visual inspection - but what about tests? Which ones can we use?

Answer

A

Shapiro Wilkes Test

Q-Q Plot

Question 58

Q

What does a Shapiro Wilkes Test.. test?

Answer

A

If data differs from a normal distribution. It is testing against the null hypothesis that you DO have a normal distribution.

if its p < 0.5 that means data varies significantly from a normal distribution, therefore normality assumption is violated.

If p > 0.5. that means data is not statistically significant, therefore is does nOT vary significantly from a normal distribution, i.e normality is not violated. So we want to see it more than 0.5

Question 59

Q

If we have a p less than 0.5, what is that saying about the data?

Answer

A

That data varies significantly from a normal distribution, better than at least 95% chance

Question 60

Q

Okay, we have been through Additivity + Linearity, and Normality. What does homogeneity of variance assume?

Answer

A

That all groups or data points have the same or similar variance = the assumption of equal variances.

If they have equal variances, there is homoescedasticity! If they do not, they have heteroscedasticity.

Question 61

Q

What is the variance from the regression line called?

Answer

A

The error, or the residual

Question 62

Q

What does the variance, or the error term from the regression line tell us?

Answer

A

The error from what we predicted the y would be based on its x value, AND what we ACTUALLY observed from the true data/

So predicted versus error (residual)

Question 63

Q

Is heteroscedasticity equal variance?

Answer

A

NO. That’s homoscedasticity

Question 64

Q

What does the final, assumption of independence, tell us?

Answer

A

assumes that the residuals are unrelated (independent) of each other, which means mostly you don’t have repeated measures of data
typically assumed based on study design as difficult to assess without knowledge
IF OBSERVATIONS ARE NON INDEPENDENT (so data is correlated with each other as data points come from same person/unit we would see downwardly biased standard errors

Answer 61

A

That means data is correlated with each other so downwardly biased standard errors . So the observations rely on one another. It is ok If this is the case but need to do something different

Answer 62

A

An outlier when considering ONLY the distribution of the variable it belongs to

Answer 63

A

An outlier when considering the JOINT distribution of two variables

Answer 64

A

Outliers when simultaneously considering multiple variables. Difficult to assess using numbers or graphs

Answer 65

A

Univariate outliers

Answer 66

A

Regression coefficients are estimates of the unknown population parameters and describe the relationship between a predictor variable and the response. In linear regression, coefficients are the values that multiply the predictor values. Suppose you have the following regression equation: y = 3X + 5. In this equation, +3 is the coefficient, X is the predictor, and +5 is the constant.

The sign of each coefficient indicates the direction of the relationship between a predictor variable and the response variable.

A positive sign indicates that as the predictor variable increases, the response variable also increases.
A negative sign indicates that as the predictor variable increases, the response variable decreases.
The coefficient value represents the mean change in the response given a one unit change in the predictor. For example, if a coefficient is +3, the mean response value increases by 3 for every one unit change in the predictor.

Answer 67

A

Bivariate means it is BREAKING AWAY FROM THE PATTERN OF THE ASSOCIATION BETWEEN YOUR TWO VARIABLES. So what you would usually see is a clear straight line between variable A and B, through those data points, but the other data point would be way off.

Answer 68

A

Remove it, or trim the data

Transform the data

Change the score through winsorizing

Answer 69

A

Change the score to the next highest value plus some small number (eg 1 or whatever appropriate to data)
convert the score to that expected for a z score of +-3.29
convert the score to the mean plus 2 or 3 SD
convert the score to a percentile of the distribution (e.g 0.5th or 99.5th percent)

Answer 70

A

a predefined quantum of the smallest and/or the largest values are replaced by less extreme values. Thereby the substitute values are the most extreme retained values.

Answer 71

A

No. It is changing the value of the data point. Goal is to keep the data in without driving the effect

Answer 72

A

for convenience or ease of interpretation - standardisation, e.g z scores allow for simpler comparisons
Reducing skewness - help get closer to normaity assumption
equalising spread of improving homogeneity of variance - produce approximately equal spreads
linearising relationships between variables - to fit non-linear relationships into linear models
making relationships additive and therefore fulfilling assumptions for certain tests

Answer 73

A

Adding a constant to each number e.g x + 1
Converting raw scores to z-scores (x - m)/SD
Mean centring (x - m)

Answer 74

A

Log, log(x) or ln(x)
Square root of x
Reciprocal, 1/x

Answer 75

A

Log because log in general is defined for positive values (can’t have negative values or zero in data set)

Answer 76

A

TRUE. Can only have positive

Answer 77

A

NO. Only zero or positive.

Answer 78

A

Can reduce the impact of large scores and stabilise variance.

Defined for ZERO and POSITIVE values.

Answer 79

A

Because non-linear transformations change the shape of the distribution

Answer 80

A

Non-linear transformations only.

Log, square root, reciprocal

Answer 81

A

In linear transformations, might increase by 1 and this would mean all values go up by one. But in non linear, a 1 unit increase from 1 to 2 is a .693 increase, and an increase from 10 to 11 is .095. So 1 unit increase is no guarantees to mean true 1 value.

Answer 82

A

Because it can hinder rather than help if wrong one is applied. For example, not truly adding 1 unit and instead through log transformation you might be adding .693.

Transformations can also make interpretation more difficult

Answer 83

A

It should be referred to as the log variable and not referring to the original raw values. Only the log values now.

Answer 84

A

Because you can visualise your distributions as well as identify any outliers.

The go to plot is a Histogram

Answer 85

A

ONE. Not talking about plotting one variable against another, like in a scatterplot.

Answer 86

A

How many people/the frequency/had each score (not another outcome variable)

Answer 87

A

Values of the variable

Answer 88

A

The median

Answer 89

A

no. But can see where distribution lies, range of interquartile, and total range on whiskers. and outliers

Answer 90

A

Error bars show uncertainty/error in the data, which tells about variability in data.

Answer 91

A

Confidence intervals

Answer 92

A

A 95% CI is the range where we are 95% that the range is likely to include any population value. So, it is a measure of error and uncertainty that the population values could still reasonably fit within that range, and that’s what you’re saying on the right hand side, the 95% CI go from 14 to 16 with a mean of 15. And again, those CI look like they overlap a lot between male participants and female participants.

Answer 93

A

Associations between two variables, so important to see if sig. relatinship, as scatterplot will show us that as data points increase in ONE< they either decrease or increase in another variable.