Final Exam Review Flashcards
ratio
has a TRUE ZERO, unlike the rest
most psychological tests
tend to be ORDINAL, but we treat, them as INTERVAL
central tendency
statistic that indicates the average or midmost score between the extreme scores in a distribution
○ Mean
○ Median: most useful when there are outliers
○ Mode: when two scores occur with the highest frequency, it is called BIMODAL
variability
indication or degree to which scores are scattered or dispersed in a distribution
Range
difference between highest and lowest scores
Interquartile range
difference between third and first quartiles of distribution
○ Semi interquartile range
interquartile range divided by 2
○ Average deviation
the average deviation of scores in a distribution from the mean
Variance
the arithmetic mean of squares of the differences between the scores in a
distribution and their mean
Standard deviation
the square root of average squared deviations around the mean. The
square root of the variance. TYPICAL DISTANCE OF SCORES FROM THE MEAN
skewness
the extent of symmetry in a distribution
positive skew
relatively few scores fall at the high end of a distribution (most scores at low end), hard test
negative skew
relatively few scores fall at the low end of a distribution, most scores at high end, easy test
kurtosis
the steepness of a distribution in its center
platykurtic
relatively flat
leptokurtic
relatively peaked
mesokurtic
somewhere in the middle
normal curve
bell shaped, smooth, mathematically defined curve that is highest in its center and perfectly symmetrical
Area under the normal curve
can be divided into areas of defined units of standard deviations
What percent of scores fall between one standard deviation above and below the mean?
68% of scores fall between one SD above and below the mean
What percent of scores fall between two standard deviation above and below the mean?
95% of scores fall between two SD above and below the mean
What percent of scores fall between three standard deviation above and below the mean?
99% of scores fall between three SD above and below the mean
Positive correlation
as one variable increases or decreases, so does the other
Negative correlation
as one variable increases, the other decreases
Weak correlation
variables do not have strong relationship with one another
restriction of range
leads to weaker correlations
Correlation coefficient
vary in magnitude between -1 and 1
correlation of 0
no relationship
pearson R
a method of computing correlation when both variables are linearly related and
continuous
Coefficient of determination:
the variance that variables share with one another (found by
squaring r)
spearman rho
a method for computing correlation used primarily when sample sizes are small or
variables are ordinal in nature
raw score
unaltered measurement
Standardized score
raw score that has been converted from one scale to another scale, where the
latter scale has some arbitrarily set mean and standard deviation
What are standard scores good for?
○ Scores are easier to interpret
○ Can compare individuals across different studies
○ Highly skewed data
Z score
conversion of raw score into a number indicating how many SDs the raw score is
above or below the mean
z=(x-_X)/s
T score
can be called a fifty plus or minus ten scale, mean set at 50, SD set at 10
Normalizing a distribution
involves “stretching” the skewed curve into the shape of a
normal curve and creating a corresponding scale of standard scores
meta analysis
a family of techniques to statistically combine information across studies to
produce single estimates of the data under study
○ Estimates are in the form of effect size which is often expressed as a correlation
coefficient
○ Useful because it examines the relationship between variables across many separate
studies
○ Important consideration: quality of population
Psychological testing assumptions:
○ 1. Psychological states and traits exist
○ 2. Psychological states/traits can be quantified or measured
○ 3. Test related behavior predicts non test related behavior
■ Responses predict real world behavior as well as future behavior ○ 4. Tests and other measurement techniques have strengths and weaknesses
■ Appreciate limitation of tests
○ 5. Various sources of error are part of the assessment process
■ Error: long standing assumption that factors other than what a test attempts to measure will influence performance on a test
■ Error variance: component of a test score attributable to sources other than the trait or ability being measured
○ 6. Testing and assessment can be conducted in a fair manner
■ Some problems are more political than psychometric
○ 7. Testing and assessment benefit society
traits
a trait is any distinguishable, enduring way in which one varies from one another
Relatively stable, may change overtime
Nature of situation influences how traits are manifested
states
less enduring
constructs
an informed, scientific concept developed to describe or explain behaviors Cannot see/touch constructs, but can infer their existence from overt behavior Constructs -> traits -> states
reliability
the CONSISTENCY of the measuring tool, the precision with which the test measures and the extent to which error is present in the measurement
validity
test measures what it intends to measure
Reliability is NECESSARY
but not SUFFICIENT for validity
Norm referenced testing
deriving meaning from test scores by evaluating an individual test taker’s score and comparing it to a group of test takers
Norms test performance data
from a specific group of test takers that are designed for use as a reference when evaluating individual test scores
Normative sample
the reference group to which test takers are compared
standardization sample
process of giving a test to a representative sample of test takers for the
purpose of establishing norms
Stratified Sampling:
Sampling that includes different subgroups, or strata, from the population
Stratified-random Sampling
every member of the population has an equal chance of being included in the sample
Purposive sampling:
arbitrarily selecting a sample that is believed to be representative of the population and it doesn’t use probability
Incidental/convenience sample
sample that is convenient or available for use
○ May not be representative of the population
○ Generalization of findings from convenience samples must be made with caution
Standardization:
the process of administering a test to a representative sample of test takers for the purpose of establishing norms
IN ORDER TO STANDARDIZE A TEST:
○ Standardize the administration (including the instructions)
○ Recommend a setting for the administration and the required materials
○ Collect and analyze data
○ Summarize data using descriptive statistics (ex. Measures of central tendency
and variability)
○ Clearly describe the standardization sample characteristics
stratified sampling method
involves the division of a population into smaller groups known as strata. In stratified random sampling, the strata are formed based on members’ shared attributes and/or characteristics
What are some cultural considerations in test construction/standardization?
○ Become aware of the cultural assumptions on which the test is based
○ Consider consulting with members of the particular cultural communities regarding the appropriateness of particular assessment techniques, tests, or test items
○ Strive to incorporate assessment methods that complement the worldview and lifestyle of assessees who come from a specific cultural and linguistic population ○ Be aware of equivalence issues across cultures, including equivalence of language used and the constructs measured
○ Score, interpret, and analyze assessment data in its cultural context with due consideration of cultural hypotheses as possible explanation for findings
criterion referenced test
: test takers are evaluated as to whether they meet a set
standard or threshold (ex. A driving exam, performance on a licensing exam)
Random Error:
a source of error in measuring a targeted variable caused by
unpredictable fluctuations and inconsistencies of other variables in the measurement process (ex. Noise)
systematic error
a source of error in measuring a variable that is typically constant or proportional to what is presumed to be the true value of the variable being measured
test construction
variation may exist within items on a test or between tests
○ (ex. Item sampling, or content sampling)
test administration
sources of error from the testing environment
○ Also, test taker variables such as pressing emotional problems, physical discomfort, lack of sleep, and effects of drugs/medication
○ Examiner related variables such as physical appearance and demeanor may
play a role
Test Scoring and Interpretation:
computer testing reduces error in test scoring but many tests still require expert interpretation (ex. Projective tests)
○ Subjectivity in scoring can enter into behavioral assessment
test retest reliability
an estimate of reliability obtained by correlating pairs of scores from the same people on 2 different administrations of the same test
● Most appropriate for variables that should be stable over time (ex. personality) and not appropriate for variables expected to change over time (ex. mood)
● Estimates tend to decrease as time passes
● With intervals over 6 months, the estimate of test-retest reliability is called the coefficient of stability
parallel forms
: for each form of the test, the means/variances of observed test scores are equal
alternate forms
different versions of a test that have been constructed so as to be parallel
○ DO NOT meet the strict requirements of parallel forms but typically item content
and difficulty are similar between tests
coefficient of equivalence
the degree of the relationship between various forms of a test
● Reliability is checked by administering two forms of a test to the same group - Scores may be affected by error related to the state of test takers (ex. Practice, fatigue, etc.) or item sampling
○ Split-half reliability + Spearman-Brown formula
split-half reliability
obtained by correlating two pairs of scores obtained from
equivalent halves of a single test administered once.
○ 3 STEPS:
1. Divide the test into equivalent halves
2. Calculate a Pearson r between scores on the two halves of the test
3. Adjust the half-test reliability using the Spearman- Brown Formula
spearman brown formula
allows test developer/user to estimate internal consistency reliability from a correlation of two halves on a test
● SB: used to estimate the effect of shortening test length; sees how well homogenous items correlate with one another
inter-item consistency
degree of relatedness of items on a test
○ Form of measuring test consistency without developing an alternate form of the test
○ Able to gauge the homogeneity of a test
○ Ideal in some cases because it is cost efficient
coefficient alpha
mean of all possible split-half correlations, corrected by Spearman Brown Formula ○ Most popular approach for internal consistency
○ Values range from 0 to 1
average proportional distance (APD)
Focuses on the degree of difference between scores on test items. It involves averaging the difference between scores on all the items and then dividing them by the number of response options on the test, minus t
inter scorer reliability
the degree of agreement/consistency between two or more scorers with regard to a particular measure
○ Often used with behavioral measures
○ Guards against biases or idiosyncrasies in scoring
○ Coefficient of inter-score reliability: scores from different raters are correlated
with on another
Understand how homogeneity vs heterogeneity of test items impacts reliability
● The more homogenous a test is, the more inter-item consistency it can be expected to have
● Test homogeneity is desirable because it allows relatively straightforward test score interpretation
Know the relation between range of test scores and reliability
● IF THE VARIANCE OF EITHER VARIABLE IN A CORRELATIONAL ANALYSIS IS RESTRICTED BY THE SAMPLING PROCEDURE USED, THEN THE RESULTING CORRELATION COEFFICIENT TENDS TO BE LOWER
● IF THE VARIANCE OF EITHER VARIABLE IN A CORRELATIONAL ANALYSIS IS INFLATED BY THE SAMPLING PROCEDURE USED, THE RESULTING CORRELATION COEFFICIENT TENDS TO BE HIGHER
What is the impact of a speed test or power test on reliability?
● Designed the speed test so that test takers can’t finish the test; low reliability
● Power: time limit is long; there are many items; can’t attempt all the items and is varied
Classical Test Theory CTT (AKA True-Score Model):
the most widely used model because of its simplicity
■ CTT assumptions are more readily met than Item Response Theory (IRT)
■ Problematic assumption of CTT has to do with equivalence of items on a
test
■ Typically yields longer tests
true score
value that according to classical test theory genuinely reflects an
individual’s ability (or trait) level as measured by a particular test
Item Response Theory (IRT):
Provides a way to model the probability that a person with X ability with be able to perform at a level of Y
○ Refers to a family of methods and techniques
○ Incorporates considerations of item difficulty and discrimination
Generalizability Theory
A person’s test scores vary from testing to testing because of variables in testing situation
○ Cronbach encouraged test developers and researchers to describe the details of
the particular test situation or universe leading to a specific test score
○ A universe is described in terms of its facets, including the number of items in the test, the amount of training the test scorers have had, and the purpose of the test
administration
Standard Error of Measurement
the amount of error inherent in an observed score of measurement ○ Usually the higher the reliability means the lower the SEM value
Be able to calculate the confidence interval if given the standard error of measurement and the confidence level index
● C.I=X+/- Index (z score) * SEMple cut scores
○ Confidence interval = mean +/- index (z score value) x standard error of
measurement
Be able to use confidence interval information to interpret test scores
● Confidence intervals tell you the interval in which you assume the true score lies within
Standard error of difference:
a measure that can aid a test user in determining how large a difference in test scores should be expected before it is considered statistically significant
face validity
a judgment concerning how relevant the test items appear to be
○ If a test appears to measure what it’s supposed to be (“on the face of it”) it could be high in face validity
○ A perceived lack of face validity may contribute to a lack of confidence in a test
Content Validity:
how adequately a test samples behavior representative of the universe of behavior that the test was designed to sample
○ “Do the items in the test adequately represent the content that should be included in the test?”
test blueprint
plan regarding the types of information to be covered by the
items, the # of items tapping each area of coverage, the organization of the items
in the test, etc.
○ Typically established by recruiting a team of experts on the subject matter and
obtaining expert ratings on the degree of item importance as well as scrutinize
what is missing from the measure
Criterion-Related validity:
A judgment of how adequately a test score can be used to infer an individual’s most probable standing on some measure of interest (i.e. the criterion)
Construct Validity (umbrella for validity)
Ability of test to measure theorized
construct (ex. Intellect, personality, etc.) that it aims to measure. Measure of validity that is arrived at by executing a comprehensive analysis of:
○ 1. How scores on a test relate to other test scores and measures
○ 2. How scores on the test can be understood within some theoretical framework
for understanding the construct that the test was designed to measure
The Validity Coefficient:
correlation coefficient between test scores and scores on the criterion measure ○ Validity coefficients are affected by restriction or inflation of range
Incremental Validity:
degree to which an additional predictor explains something about the criterion measure that isn’t explained by predictors already in use
○ “To what extent does a test predict the criterion over and above other variables?”
Understand what constitutes good face validity and what happens if it is lacking/why we might not want the test to be face validity
● If a test seems subjectively relevant and transparent from the perspective of the test taker, it has good face validity
● One may not want the test to be face valid if they do not find it to be so
criterion
the standard against which a test/test score is evaluated
● An adequate criterion is valid for the matter at hand, valid for the purpose it is being used, and uncontaminated, meaning it is not part of the predictor
Concurrent validity
an index of the degree to which a test score is related to some criterion measure obtained at the same time (concurrently)
Predictive validity
an index of the degree to which a test score predicts some criterion, or outcome, measure in the future. Tests are evaluated as to their predictive validity.
Base Rate:
extent to which the phenomenon exists in a population
Hit Rate:
accurate identification (true-positive/negative)
Miss Rate:
failure to identify accurately (false positive/false negative)
False positive
a miss wherein the test predicted that the test taker did possess the
particular characteristic or attribute being measured when in fact the test taker did not
False negative:
is a miss wherein the test predicted that the test taker did not possess the particular characteristic or attribute being measured when the test taker actually did
What happens to the validity coefficient when you restrict or inflate the range of scores?
● Using a full range of test scores enables you to obtain a more accurate validity coefficient, which will usually be higher than the coefficient you obtained using a restricted range of scores
● Increased range = higher validity coefficient
Incremental validity
degree to which an additional predictor explains something about the criterion measure that isn’t explained by predictors already in use
“To what extent does a test predict the criterion over and above other variables?”
IMPORTANCE: Adding an additional predictor will change the criterion measure, regardless of the already established predictors. If the additional predictor has value to the test, it is important.
If a test has high construct validity, what does this tell you about the test?
● It means the test is a valid measure of the construct, thus a good test
● Be familiar with the different types of evidence for construct validity
Evidence of homogeneity:
how uniform a test is in measuring a single concept
Evidence of changes:
some constructs are expected to change over time (ex.Reading rate)
Evidence of pretest/posttest changes:
test scores change as a result of some
experience between a pretest and posttest (ex. therapy)
Evidence from distinct groups:
scores on a test vary in a predictable way as a
function of membership in some group (ex. Scores on the psychopathy checklist for prisoners vs. civilians)
Convergent evidence:
correlate highly in the predicted direction with scores on older, more established tests designed to measure the same constructs
Discriminant evidence:
showing little relationship between test scores and other variables with which scores on the test being construct validated should not theoretically be correlated
factor analysis
A new test should load on a common factor with other tests of the same construct
bias
A factor inherent in a test that systematically prevents accurate, impartial measurement
Implies systematic variation in scores
○ Prevention during development is best cure for bias
Fairness:
the extent to which a test is used in an impartial, just, equitable way
How do bias and fairness relate? Can you have an unbiased, yet unfair test?
A test can not be fair if it is biased and vice versa. A test can be free of bias but still be unfair
rater error
a judgment resulting from the intentional or unintentional misuse of a rating scale ○ Raters may either be too lenient (leniency error/generosity error), too severe
(severity error), or reluctant to give ratings at the extremes (central tendency
error)
■ EX. Leniency error: teacher being an easy grader
■ EX. Severity error: Movie critics panning everything they review
■ EX. Central tendency error: an employer will most likely rate most of their employees towards the middle between 1-10 instead of 1 or 10