Module 1 Flashcards
A psychological theory says that individual differences in one variable BLANK can be predicted from or causally explained by another variable
- Dependent variable
- Independent variable
Other names for independent variable
Predictor, covariate, explanatory variable, exogenous variable, X
(these terms are not perfectly synonymous, depending on context, but they are essentially
interchangeable with respect to how they are included in statistical models)
Experimentally manipulated/naturally occurring ie. country people come from
Other names for dependent variable:
Outcome, criterion, response variable, endogenous variable, Y
All statistical models are fundamentally ___________
descriptive,
in that they describe the nature of a
dependent variable as a function of one more independent variables or covariates.
Models commonly used for two things
Description
Causal Explanation
Prediction
models are also commonly used for
causal explanation:
The model represents the process(es) by which differences in independent variables influence differences in a dependent variable.
prediction
meaning that observed data is used to develop
a model for how independent variables are related to dependent variables, and then that model is used to predict dependent variable scores in future data.
the primary purpose of machine learning is _____
prediction
For example, a social media company
might use data about a person to predict whether that person is likely to click on an ad for a
product.
This prediction is based on a statistical model developed using data from people who have
already clicked on the ad.
How psychologists misapply the word predict
Yet, psychologists often use language about “prediction” when presenting statistical models that are mainly meant to describe or explain the association between an independent and a dependent
variable.
For example, a researcher might report that a personality trait “predicts” whether adults suffer
from sleep disturbances.
But this “prediction” is likely meant to explain why certain people are pre-disposed to experience sleep disturbances,
and the statistical model is not necessarily going to be applied to future data to determine the
chance that a given person has a sleep disturbance.
But true statistical prediction is not concerned with “why”.
The population of interest
population:
Definition 1: The set of all entities (e.g., people, animals, cities, etc) for which a theory is
intended to apply.
Definition 2: The set of all entities to which a research study generalizes.
Definition 3: The natural (psychological) process that created the observed data.
- The sampling scheme
sample definition
finite subset of entities (or observations) drawn from a particular population.
GRE predictor confusion
GRE is supposed to predict whether a student will successfully complete grad school - GRE scores predictor - success dependent
- Not about causal mechanism
Good GRE score isn’t going to cause you to have a good PhD
- Define operational variables: How are we actually going to observe or measure the conceptual
variables?
Operational variable = conceptual variable + measurement error
Often, independent variables are assumed to be measured without error.
This assumption holds in experimental studies, where participants are assigned to a particular
treatment or control group. Group membership, the independent variable, is known for all
participants (regardless of whether random assignment was used).
But in a lot of psychological research, both independent and dependent variables are
characterized by measurement error. If ignored, measurement error introduces statistical bias in
model estimates.
What are the 3 major features of study design:
population
sample
define operational variables
Continuous variables
have a scale with an infinite number of possible values.
Discrete variables
are categorical; they have a scale with a finite number of possible values.
in psychology - measure many continuous variables on a likert scale which is categorical
Nominal variables
have a scale whose values have arbitrary numerical meaning.
It only makes sense to say whether two observations are equal, but we cannot say that one nominal value is “greater than” or “less than” another.
For example, membership in a treatment or control group might be numerically coded so that 0 = control and 1 = treated, but the specific numerical values chosen are arbitrary.
Ordinal variables
have a scale such that lower values are meaningfully defined to be less than
higher values, but we don’t necessarily know by how much a lower value is less than a higher
value.
a Likert-type item response
frequency distribution
is a representation (either tabular or graphic) of the observed values of a
variable along with the frequency, or number of observations, occurring with each value.
Relative frequency
is the proportion (or percent = proportion × 100) of observations at a given value of a variable.
histogram
is a graph of the frequencies observed at each of several intervals (or bins) along the continuous scale of the variable
Histogram provides frequency within each bin
Distributions of continuous variables are characterized by their
centre, spread, and shape
outlier
is an unusual observation that falls well outside of the range of most of the other observations in the distribution
Outliers can occur because of…
sampling error (the outlying observation comes from a different
population than the other observations),
researcher error (e.g., a data entry mistake was made),
participant error (e.g., the participant did not follow the researcher’s instructions),
or just random chance.
Exclude outlier with which types of error
researcher or participant error
what is the spread
the extent of variability or individual differences in the variable
E.g. Scores are clustered from blank to blank but a notable number of people have lower scores
Unimodal
one general peak
sensitivity analysis
do analysis with outlier and without and report on both sets of data
descriptive statistics
describe the centre, shape, and spread of a distribution using numerical information
parameter
numerical characteristic of a population
statistic
value calculated from the sample data that estimates a parameter
Which central tendency measure is higher than the other when asymmetric
mean gets pulled in the direction of the skewness
3 measures of spread or variability
- variance
- standard deviation
- interquartile range
the mean is more affected by blank then blank
the mean is more affected by outliers than the median
standard deviation
represents the average amount that a score differs from the mean of a distribution
Calculate sample SD
- Deviations from the mean - observed score subtract the mean
- Square the answers
- Mean of the squared deviations
SQUARE ROOT OF THE VARIANCE
sample SD is an estimate of the population SD
Sample variance
Mean of the squared deviations
estimate of the population variance
Why do we divide the sample SD by N-1 and not N
leads to a biased estimate of the population standard deviation, dividing by n-1 corrects this bias
when we calculate the sample mean we ‘use up’ once piece of information
degrees of freedom associated with a univariate standard deviation
Interquartile range
IQR is defined as Q3-Q1
Range of a distribution
difference btw max and min
Boxplot
top of box
bottom of box
hard line
whiskers
Q3
Q1
median Q2
whiskers max and min
outliers show up as dots
Boxplot is negatively skewed if
distance from the median to Q1 is slightly greater than the distance to Q3
probability density functions
give the probability of observing a particular value of a variable
To get the hypothetical probability distribution
Normal distribution
Normal is a population distribution
do NOT describe a sample as normal
would make sense that the sample was DRAWN from a normal population distribution
Normal distribution is a function of the population mean and SD
the normal distribution is a BLANK population distribution
HYPOTHETICAL population distribution
doesnt make sense to refer to a sample as normal
describe as consistent with a normal distribution
Mean is known as the blank blank of a population distribution
first moment
what is the first moment of a population distribution
mean
the variance is known as the blank blank blank of a population distribution
second central moment
what is the second central moment of a population distribution
variance
The mean and variance are both a BLANK
average
- variance is the average of the squared deviations from the mean
why is the variance called a central moment
deviating from the mean
what is the third central moment of a population distribution
skewness
third central moment
skewness
skewness
extent to which the distribution is asymmetric
skewness formula
the numerator is the sum of cubed deviations from the mean
What is the fourth central moment
kurtosis
kurtosis
extent to which the distribution shape is flat (negative kurtosis) or has a steep peak with thick tails (positive kurtosis)
kurtosis
fourth moment of the population distribution
kurtosis formula
raised to the 4th power
is it worse to have non-zero kurtosis or skewness
having non-zero kurtosis is more problematic than skewness (ie having kurtosis is worse)
distributions with strong skewness also have nonzero kurtosis