Week 2 Flashcards
Bivariate distributions
two score for each individual
Scatter Diagram
picture of the relationship between two variables
an important reason for examining the scatter diagram is that the relationships between X and Y are not always best described by a straight line.
Regression
Trying to predict a variable Y from another variable X
Best guess from a midterm mark to a final - use data from past - use this on a new population
make predictions about scores on one variable from knowledge of scores on another variable-
Regression - Galton
Individuals with unusual characteristics tended to produce offspring who were closer to average
Regression towards mediocrity - idea became the basis for a statistical procedure that described how scores tend to regress toward the mean
Why is regression important in psychological testing?
Figure out associations between different variables and measurements
Determine whether changes in test scores are related to changes in performance
make predictions about scores on one variable from knowledge of scores on another variable
difference btw regression and correlation
Regression done on the actual numbers
Correlation takes those numbers and uses standardized units
use correlation to assess the magnitude and direction of a relationship.
regression, is used to make predictions about scores on one variable from knowledge of scores on another variable.
Regression equation & Residual
gives a predicted value for y as denoted by Y’
Y’ = bx + a
Y’ = the predicted value of Y
b = regression coefficient - slope of the line
===. The regression coefficient can be expressed as the ratio of the sum of squares for the covariance to the sum of squares for X. Sum of squares is defined as the sum of the squared deviations around the mean.
a = value of Y when X is 0. a = ybar - bxbar
actual and predicted are rarely the same
The difference between the observed and predicted is the residual - best fitting line keeps residuals to a minimum - minimizes deviation between observed and predicted
Because residuals can be positive or negative and will cancel to 0 if averaged, the best-fitting line is most appropriately found by squaring each residual.
Regression line & Principle of least squares
Used to find the regression line
Minimizes the squared deviation around the regression line
Understand:
Mean is the point of least squares for any variable. Sum of squared deviations around the mean will be less than it is around any value other than the mean.
Regression line is the running mean or line of least squares.
The least squares method in regression finds the straight line that comes as close to as many of these Y means as possible. In other words, it is the line for which the squared deviations around the line are at a minimum.
best-fitting line is obtained by keeping these squared residuals as small as possible. This is known as the principle of least squares
SUM (Y-Y)^2 is at a minimum
observed - predicted
Sum of cross Products (covariance)
Variance around each mean
How far away are all x’s from mean of x
How far away from y from mean of y
Covariance & the goal of regression analysis
Covariance - Whether two variables covary - does y get larger as X gets larger
The covariance is calculated from the cross products, or products of variations around each mean.
Regression analysis attempts to determine how similar the variance between two variables is by dividing the covariance by the average variance of each variable
Intercept of the regression line = a
A = ybar - bxbar
Regression Plot
Pictures that show the relationship between variables
Common use of correlation is to determine the criterion validity evidence for a test, or the relationship between a test score and some well-defined criterion.
association between a test of job aptitude and the criterion of actual performance on the job is an example of criterion validity evidence.
normative because it uses information gained from a representative group
Correlation
Correlation is a special case of regression in which the scores for both variables are in standardized, or Z, units.
correlation coefficient is that it has a reciprocal nature. The correlation between X and Y will always be the same as the correlation between Y and X
regression does not have this property.
eliminates the need to find the intercept
In correlation, the intercept is always 0
Correlation coefficient - describes the direction and magnitude of the relationship
assess the magnitude and direction of a relationship
Regression but with the scores normalized - varies between -1 and 1 = no intercept value
Correlation between two randomly created variables will not always be 0
By chance alone its possible to observe a correlation higher or lower than 0
null hypothesis is rejected if there is evidence that the association between two variables is significantly different from 0.
Correlation coefficients can be tested for statistical significance using the t distribution
t distribution
t distribution is not a single distribution (such as the Z distribution) but a family of distributions, each with its own degrees of freedom.
The degrees of freedom (df ) are defined as the sample size minus two, or N -2
Different kinds of correlation coefficient
Pearsons = ratio scale, occasional interval like likert
determine the degree of variation in one variable that can be estimated from knowledge about variation in the other variable
Different kinds of correlation coefficient
Biserial r
biserial correlation expresses the relationship between a continuous variable and an artificial dichotomous variable
relationship between passing or failing the bar examination (artificial dichotomous variable) and GPA in law school (continuous variable).
Different kinds of correlation coefficient
Point biseral r
dichotomous variable had been “true” (such as gender),
For instance, the point biserial correlation would be used to find the relationship between gender and GPA
Tetrochoric r
Different kinds of correlation coefficient
both dichotomous variables are artificial, we might use a special correlation coefficient
Different kinds of correlation coefficient
Phi
Depends on whether variables are continuous, dichotomous (artificial or true)
both variables are dichotomous and at least one of the dichotomies is “true,” then the association between them can be estimated using the phi coefficient
Also coefficients for rank correlations
Spearman’s Rho
Rank order variables
correlation for finding the association between two sets of rank
rho coefficient (r) is easy to calculate and is often used when the individuals in a sample can be ranked on two variables but their actual scores are not known or have a normal distribution
One whole family of correlation coefficients involves dichotomous variables.
true dichotomous because they naturally form two categories - gender
artificially dichotomous because they reflect an underlying continuous scale forced into a dichotomy. Passing or failing a bar examination is an example of such an artificial dichotomy;
Residual
Y - Y’
Observed - predicted
The difference between the predicted and the observed values is called the residual.
sum of the residuals always equals 0
sum of the squared residuals is the smallest value according to the principle of least squares [Σ(Y2Y′) 2 5 smallest value].
Standard Error of Estimate
How far apart are my predicted and observed
standard deviation of the residuals
measure of the accuracy of prediction
most accurate when the standard error of estimate is relatively small.
Coefficient of Determination r^2
What percentage of variation in Y that is known as a function of knowing X
How much is accounted for
Coefficient of Alienation
Sqrt (1-r^2)
How not associated the variables are
r is the coefficient of determination
General Multivariate Models: Linear Combination
Multiple X variables and regression coefficients
relationship among combinations of three or more variables
study the relationship between many predictors and one outcome, as well as the relationship among the predictors.
multiple regression, and the goal of the analysis is to find the linear combination of the three variables that provides the best prediction of law school success.
law school GPA 5 .80 (Z scores of undergraduate GPA) + 1.54 (Z scores of professor ratings) + 1.03 (Z scores of age)
reason for using Z scores for the three predictors is that the coefficients in the linear composite are greatly affected by the range of values taken on by the variables.
standardized regression coefficients
When the variables are expressed in Z units, the coefficients, or weights for the variables, are known as standardized regression coefficients
raw regression coefficients
weights in the model are called raw regression coefficients
Discriminant Analysis
When the task is to find the linear combination of variables that provides maximum discrimination between categories, the appropriate multivariate method is discriminant analysis.
attempts to determine whether a set of measures predicts success or failure on a particular performance evaluation
For example, say that two groups of children are classified as “language disabled” and “normal.” After a variety of items are presented, discriminant analysis is used to find the linear combination of items that best accounts for differences between the two groups
Shrinkage
Regression equation - tendency to overestimate the relationship, particularly if the sample of subjects is small
Shrinkage is the amount of decrease observed when a regression equation is created for one population and then applied to another
regression equation is developed to predict first-year college GPAs on the basis of SAT scores.
Although the proportion of variance in GPA might be fairly high for the original group, we can expect to account for a smaller proportion of the variance when the equation is used to predict GPA in the next year’s class
Cross Validation
ensure that proper references are being made is to use the regression equation to predict performance in a group of subjects other than the ones to which the equation was applied.
standard error of estimate can be obtained for the relationship between the values predicted by the equation and the values actually observed
Correlation-Causation Problem
Just because two variables are correlated does not necessarily imply that one has caused the other
Third Variable Explanation
the apparent relationship between viewing and aggression actually might be the result of some variable not included in the analysis.
Restricted Range
circumstances in which the ranges of variability are restricted.
relationship between scores on the Graduate Record Examination GRE quantitative test and performance during the first year of graduate school in the math department of an elite Ivy League university.
No students had been admitted to the program with GRE verbal scores less than 700.
most grades given in graduate school were A’s.
might be extremely difficult to demonstrate a relationship even though a true underlying relationship may exist.
. Correlation requires variability. If the variability is restricted, then significant correlations are difficult to find.
Factor Analysis
Trying to find some common factors amongst complex, intercorrelated datasets
How many factors do you need to explain the most variance?
How do developmental psychologists find underlying dimensions when we can only observe specific behaviors
How often does baby cry, sensitivity to lights, excessive fear of strangers
Some behaviors will cluster together
Sensitivity to pain and crying
Sea monster analogy
Visible parts move together and other move independently - intuitive correlation
Correlations between parts we can see = observable behaviors
We can infer about their underlying nature = theoretical constructs
Factor analysis
a statistical method that looks at how lots of different observations correlate and determines how many theoretical constructs could most simply explain what you see
linear combinations of variables that maximize the prediction of some criterion
matrix that shows the correlation between every variable and every other variable
Find the linear combinations, or principal components, of the variables that describe as many of the interrelationships among the variables as possible
first component will be the most successful in describing the variation among the variables, with each succeeding component somewhat less successful. Thus, we often decide to examine only a few components that account for larger proportions of the variation
find the correlation between the original items and the factors. These correlations are called factor loadings
Meaurement
Whats the point - usually have a choice
Trade off between complexity and precision
Nominal, ordinal, interval, ratio
Least complex to morst complex
Lease precise to most precise
Program, percentile rank, mcmaster grade, final percentage
Three more correlation concepts
Bidirectionality of Predictions
X correlates with Y, Y correlates with X just as much
Three more correlation concepts
Restriction of Range
school level and knowledge of geography
Restrict to grade 3 cant see a correlation
Does GRE predict performance in grad school - doesnt correlate well to grad school- not a great predictor - restriction of range - we are only taking GREs that were high
Would need to let EVERYONE in to test this
Always inspect scatterplots - range restrictions, outliers, nonlineraties (curves)
Regression to the Mean
Father and biological sons
If a faller is taller than average - we predict his son will also be taller than average
Height is genetically linked, so will be correlated
However, we predict that the son will be a little shorter - closer to average - than his dad
If a father is shorter than average - predict that son will also be shorter than average
Positive correlation
However we predict that son will be a little taller - closer to average - than his dad
The taller the father is, the more we expect the son to be significantly shorter or vice versa
regression to the mean - grades
Midterm grades are positively, but not perfectly correlated with final exam grades
If you do better than average on the midterm you would do better than average on the final - but probably do a little worse
Worse than average on the midterm - better on the final - best prediction is still less than average
What if one test is easier than the other - problem - transforms into z scores if things are normal
FURTHER FROM MEAN ON ONE - CLOSER ON NEXT ONES
Regression to mean only happens when stats are
imperfectly correlated
Remember a perfect correlation +1/-1
Correlation: y = rx
(if x and y are in standardized units - zscores - and r is the correlation coefficient
Trying to predict y in z units from x in z units
If two scores are perfectly positively correlated what is the relationship btw x and y
What if x was one sd higher than the mean
What if x was two sds higher than the mean
If two scores have a correlation of 0.5, what is the relationship btw x and y
What if x was one sd higher than the mean
What if X was two SDS higher than the mean
Y = 0.5x - son will be 0.5 sds away from the mean
Taller dad x - 2sds from mean
Y = 0.5(2)
Son will be one sd above the mean - shorter than dad - 1 sd
Statistical Concepts and Rationality
If you understand the actual concepts of the normal distribution (lots of people are near the middle, fewer on the outside), plus how correlation works, including restriction of range and regression to the mean, you are in a position to act more rationally than the vast majority of the population
Spearman’s Early Studies
Spearman actually worked out most of the basics of contemporary reliability theory and published his work in a 1904 article entitled “The Proof and Measurement of Association between Two Things.”
Reliability
Does a test measure something the same way - do we get the same results every time
We dont have perfect measures - trying to measure things that are difficult to measure
Does our depression meter come up with the same thing -
What are some reasons people do better or worse in an exam than they “should”?
The test itself
The test taker
Not feeling well etc
The environment
Room was hot, loud, coughing, alarm
How the test was scored
Essay - unfair - ta scoring unreliabl
True vs. Observed Scores
Theoretical idea of a true score
Imagine taking a test and receiving a score - observed score
We use it to estimate some theoretical TRUE score
Basics of Test Score Theory
Classical test score theory assumes that each person has a true score that would be obtained if there were no errors in measurement.
observed for each person almost always differs from the person’s true ability or characteristic
If it is a reliable test, the observed score should be pretty close to this theoretical true score
If the test isn’t very reliable, we would expect the observed score might be not all that close to the true score
Imagine being IN THEORY able to take that same test over and over, and receiving an observed score X each time
In real world, we cant really do this because of practice effects etc
We could plot the distribution of all those observed X values
Turn out normally distributed - plot and find that they are clustered around the mean - basic normal distribution
Normally distributed - know lots of things about this already
Need to know mean
Need to know SD
Can describe hundreds of things with 2 numbers
The mean of this distribution is the theoretical true score
The observed scores are normally distributed around the true score
Small SD - tightly clustered around the mean
Small variance - observed score is probably a good guess for true score
Why would an observed score differ from a true score
Error - nothing is perfect
X = T + E
Observed score X is true score plus the error
All theoretical cant actually calculate this - error can be pos or neg