Practical W2: Basics of Statistics Flashcards
One of the first things that’s super important after collecting your data is to graphically look at your data by making a
histogram
There are two main ways in which a distribution can deviate from normal - (2)
- skewness
- Kurotsis
Diagram of positive and negative skew
If the skewness value between -1 and 1 in SPSS then
it’s fine
If the skewness value in SPSS is less than -1 then
it is a negative skew = non-normal distribution
If the skewness value in SPSS is greater than 1 then
positive skew = non-normal distribution
Diagram of skewness value shown in SPSS
Kurotsis is basically looking at how
‘pointy’ your histogram is
Kurtosis tells us how much our data lies around the
ends/tails of our histogram which helps us to identify when outliers may be present in the data.
A distribution with positive kurtosis, so much of the data is in the tails, will be very
pointy or leptokurtic
A distribution with negative kurtosis, so the data lies more in the middle, will be more
sloped or platykurtic
Normal distribution will have kurotsis value of
0 (mesokurtic)
Characteristic of a negative skew
tail it is pointing towards the lower values and the data is clustered at the higher values
Characteristic of a positive skew
– the tail is pointing towards the higher values and the data is clustered at the lower values
Diagram of mesokurtic (normal) , leptokurtic and platykurtic distribution curve
Kurotsis value in SPSS between -2 and 2 is
all good, normal kurotsis
If kurotsis value in SPSS is less than -2 then shows
platykurtic (non-normal, issue with kurotsis)
If kurotsis value in SPSS is greater than 2
leptokurtic (non-normal, shows issues with kurotsis)
Diagram of kurotsis value in SPSS
Is kurotsis and skewness value here fine?
Good because both the skewness is between -1 and 1 and kurtosis values are between -2 and 2.
Is kurotsis and skewness values fine here?
Bad because although the skewness is between 1 and -1, we have a problem with kurtosis with a value of 2.68 which is larger than 2 and -2
3 ways to transformations your data to make it closer to normal distribution - (3)
- exponential
- power
- log
There is a tertium quid which prompts the saying that
correlation not causation
What is tertium quid a word for?
third factor?
The tertium quid is a variable that you may not have considered that
could be influencing your result
The tertium quid (third factor) is known as a
confounding variable
Example of may not considered tertium quid variable could be influencing your results - (2)
: we find that drownings and ice cream sales are correlated, we conclude that ice cream sales cause drowning. Are we correct?
NO, , since it is most likely that both are actually due to weather, and when it’s hotter outside people eat more ice cream and go more frequently to the pool or to the beach to swim.The fact that more people go to swim is the reason why there are more drownings.
If one/both of skewness/kurotsis value is out of range than assumptions for
parametric tests is not satisfied
Rule out tertium quid (third factor) through
RCTs = even out confounding variable between groups
In RCT, you randomly assign your participants to two or more groups involving - (2)
one group receives no intervention or experimental manipulation (so your control),
other group will receive the intervention or treatment and then you can directly compare the dependent variables.
To infer causation we need to
actively manipulate the variable we are interested in, and control against a group (condition) where this variable was not manipulated.
Example of control condition in a lesion studies - (2)
double dissociation experiment where one test is affected by a lesion in one area but not a second area and then a different test is conducted which affects the second area but not the first.
The only way we can actually infer causation is by comparing the two controlled situations; one where the cause so the lesion is present and one where the lesion is absent.
Another assumption for parametric tests is having
linearity/addivity
Linearity refers to the - (2)
combined effect of several predictors should form a straight line or show a linear relationship
the data increases at a steady rate like the graph
What does this graph show?
Your cost increases steadily as the number of chocolate bars increases
This graph shows multiplicative/non-linear (not steady but sharp increase/change in data) which is not an assumption of
parametric tests
What does this graph show?
might feel ok if you eat a few chocolate bars but after that the risk of you having a stomach-ache increases quite rapidly the more chocolates you eat.
Why is it important to check for linearity in your data?
your statistical analysis will be wrong even if your other assumptions are correct because a lot of statistical tests are based on linear models.
When we talk about additivity/linearity we are referring to the combined effect of
several predictors
What is measurement error?
The discrepancy between the actual value we’re trying to measure and the number we use to represent that value.
Example of measurement error - (2)
conducting an experiment where I was measuring the length of a tree and used cm and someone else in my research group measured the same tree using a different metric and got a different value from me that’s a measurement error.
This is an example or human error but recording instrument failure is another possibility.
What are the 2 types of measurement error? - (2)
- Systematic measurement error
- Random measurement error
Measurement error can happen across all psychological experiments from…
recording instrument failure to human error.
What is systematic measurement error?
when the error is proportional to the the true value and effects the results of experiment in a predictable direction
What is example of systematic measurement error?
for example if I know I am 5ft2 and when I go to get measured I’m told I’m 6ft this is a systematic error and pretty identifiable - these usually happen when there is a problem with your experiment
What is random measurement error and when does it usually occurs? - (2)
when the measurable values are inconsistent when repeated measures of a constant attribute or quantity is taken,
so this error happens by chance and is more related to natural variabilit
Example of random measurement error - (2)
my height is 5ft2 when I measure it in the morning but its 5ft when I measure myself in the evening.
This is because my measurements were taken at different times so there would be some variability – for those of you who believe you shrink throughout the day.
Measurement error is completely different from variance in the sense that it is the
average spread of your data
Variance is specifically the averaged squared deviation from
each number from its mean
Variance helps us assess group differences to determine whether the populations that our samples come from
differ from each other
How to calculate variance?
Example of variance in line graph (orange dots and lines are variance)
The purpose of a control condition is to allow inferences about causality as field’s quote was:
only way to infer causality is through comparison of two controlled situations: one in which cause is present and one in which cause is absent
What are residuals?
difference between the observed value of the dependent variable and the predicted value (usually mean).
GLM assumption is that residuals will be
normally distributed - observed values of a variable will be normally distributed around the predicted value.
Last assumption of GLM: Homoscedasticity which is that
residuals have constant variance at every level of x – for each level of the independent variable the amount of error or “noise” has a similar variance
What is a dependent variable?
A dependent variable (or outcome variable) is a variable that is thought to be affected by changes in an independent variable.
What is a confounding variable? - (2)
A confounding variable is a variable which has an unintentional effect on the dependent variable.
When carrying out experiments we attempt to control these extraneous variables; however, there is always the possibility that one of these variables is not controlled and if this affects the dependent variable in a systematic way, we call this a confounding variable.
Predictor variables is
variable that is thought to predict another variable.
What is an independent variable? - (2)
An independent variable is a variable that is thought to be the cause of some effect.
This term is usually used in experimental research to denote a variable that the experimenter has manipulated.
We can not control for everything especially in sale of chocolate bars we might expect other variables to impact popularlity of chocolate so in LM (linear model) we can add something called - (4)
predictor variable, this are additional variables that are related to what your variable of interest.
For example, the time of year may be a predictor variable – like over easter you may see an increase in sales
In GLM you can plug this predictor variable and any others to expand your model using predictor variables i.e independent variables you may not be directly interested in.
we have several predictors in a regression it is a multiple regression.
central limit theorem tells us that that if we have enough participants (typically larger than 30) the sampling distribution of the mean approaches a
normal distribution
The central limit theorem states that the sampling distribution of the mean approaches a normal distribution, as the sample size increases.
This fact holds especially true for
sample sizes over 30 –> N >30
, as a sample size increases, the sample mean and standard deviation will be (CLT)
closer in value to the population mean μ and standard deviation σ .
The central limit theorem tells us that no matter what the distribution of the population is, the shape of the sampling distribution will approach normality as the sample size (N)
increases
How is CLT useful? - (2)
research never knows which mean in the sampling distribution is the same as the population mean,
but by selecting many random samples from a population the sample means will cluster together, allowing the research to make a very good estimate of the population mean.
as the sample size (N) increases the (CLT)
sampling error will decrease
In a normal distribution the values of skew and kurtosis are
0
Definition of tertium quid
the possibility that an apparent relationship between two
variables is actually caused by the effect of a third variable on them both (often called the third-variable
problem)
Definition of confounding variable
a variable (that we may or may not have measured) other than the predictor variables in which we’re interested that potentially affects an outcome variable.
Confounding variable jeopardises the
reliability and validity of an experiment’s outcome
Confounding variables can be measured using reliable and
unreliable scale
A test can still measure a useful construct or vaeriable but still not be
valid
Internal consistency is - (2) and example
It measures whether several items that propose to measure the same general construct produce similar scores.
e.g., pp expressed agreement with statement like “enjoyed rock music” and disagreed with statement like “I hate rock music”
DV or outcome variable is variable thought to be affected by changes in
independent variable
An independent variable is a variable that is thought to be the cause of
some effect
Reliability is whether an instrument can be
interpreted consistently across different situations
What is the ‘fit’ of a model?
The ‘fit’ of the model is the degree to which a statistical model represents the data collected
Counterbalancing can compensate for
practice effects as ensure that they produce no systematic variation between our conditions since it counterbalances the order in which person participates in a condition
Practice effects are an issue in what design?
repeated design
Giving participants a break between tasks is a technique used to compensate
boredom effects
Homogenous variance assumption is that variance
within each of the populations is equal
Residual variance helps us confirm how well a - (2)
regression line that we constructed fits the actual data set.
The smaller the variance, the more accurate the predictions are
The coefficient of determination is the correlation
coefficent squared: amount of variability in one variable shared by another
The sum of squares, variance and standard deviation are all measures of the
dispersion or spread of data around the mean (
The probability is p = 0.80 that a patient with a certain disease will be successfully treated with a new medical treatment. Suppose that the treatment is used on 40 patients. What is the “expected value” of the number of patients who are successfully treated?
Calculation
because 80% of 40 patients is 32 (or 40 x .80 = 32)
The sum of squared errors is the sum of the
squared deviances
Assumptions of parametric data - (4)
- Normally distributed data
- Homogenity of variance: ariances should be the
same throughout the data. - Data measured at least at interval level
- Indpeendence