GIGA PRACTICE Flashcards
2 main categories of tests
Ability tests vs Personality tests
Ability test def
Measure skills in terms of speed, accuracy, or both.
=> The faster or the more accurate your responses, the better your scores on a particular characteristic.
What are the 3 types of ability tests?
Achievement, Aptitude and Intelligence tests
Achievement test def
Measures previous learning.
- E.g. A test that measures or evaluates how many words you can spell correctly is called a spelling achievement test.
Aptitude test def
Measures potential for acquiring a specific skill.
- A spelling aptitude test measures how many words you might be able to spell given a certain amount of training, education, and experience.
Intelligence test def
Measures potential to solve problems, adapt to changing circumstances, and profit from experience.
Types of personality tests
Structured (objective) and Projective tests
Structured personality tests def
Provides a self-report statement which require the subject to choose between two or more alternative responses such as “True” or “False”; “Yes” or “No”.
Reliability def
Degree to which test scores are FREE OF MEASUREMENT ERRORS.
-> There are many ways a test can be reliable (e.g., test results may be reliable over time).
A psychological test must be (3)
(1) Objective: reflect reality - not what we want reality to be
(2) Reliable: provide us with the same reading anytime, use instrument under the same conditions
(3) Valid: measure what we want to measure
How do Psychological Tests differ from Other Measurement Tools? (2)
(1) Focus on intangible, theoretical CONSTRUCTS (e.g. psychological attributes) unlike tools measuring physical properties (e.g. rules, scales).
(2) For most of them, you need to have some SPECIALIZED KNOWLEDGE for proper interpretation unlike physical measurements (e.g. ruler).
Construct def
Unobservable, theoretical abstract concept. Measured indirectly through behaviours, responses or test results
E.g. intelligence, anxiety, self-esteem
Defining Characteristics of Psychological Tests (5)
(1) Representative SAMPLE behaviors
(2) OBSERVABLE and MEASURABLE actions
(3) Thought to measure a PSYCHOLOGICAL ATTRIBUTE
(4) Behavioral samples obtained under STANDARDIZED conditions
(5) Have results for SCORING.
A construct is hypothesized to explain _________________________________
the covariation between observed behaviors
Kinds of Purposes for Testing (4)
(1) Classification
(2) Promoting Self- Understanding and Self-Improvement
(3) Planning, Evaluation and Modification of Treatments and Programs
(4) Scientific Inquiry (Quantification, Hypothesis testing)
Types of scales (4)
Nominal, Ordinal, Interval, Ratio
Types of Norms (3)
(1) DEVELOPMENTAL Norms
(2) WITHIN-GROUP Norms
Norms without a Norm Sample
(3) CRITERION-REFERENCED Norms
Developmental Norms def
Typical level of performance in each of the AGE group or grade levels that the test’s target population comprises.
-> Age-equivalent or grade-equivalent scores are assigned based on the MEDIAN RAW SCORE for that chronological age or grade level.
-> Median = TYPICAL score = norm
Within-Group Norms (3)
(1) Percentiles
(2) Z-scores
(3) Transformed standard scores
Standard Deviation def
A measure of the average distance of scores from the mean.
Transformed Standard Score formula
Bz + A
B = desired SD
A = desired Mean
Percentiles disadvantages (2)
(1) Magnifies differences near mean; minimizes differences at extremes
(2) Some common statistical analyses are NOT possible with percentiles
Standard score disadvantages (2)
(1) Unfamiliar to many non-specialists
(2) Interpretation difficult when distribution not normal
Criterion-Referenced Norms def
Evaluate performance relative to an absolute criterion or standard rather than performance of other individuals.
-> An absolute vs relative evaluation
Within-Group Norms: Criticisms (2)
(1) Only meaningful if the standardization (norm) sample is representative
(2) Within-group comparisons encourage competition
Requirement for Criterion-Referenced Norms
Define content of domain narrowly and specifically.
E.g. Driving skills, 8th grade math curriculum
Criterion-Referenced Norms: Issues (3)
(1) Can elements of performance be specifically defined?
-> Hard to clearly define what “good” or “bad” performance looks like.
-> Criterion-referenced norms require a clear standard (e.g., scoring 80% on a test to pass), but creating these standards can be challenging because it’s hard to decide what knowledge or skills are essential.
(2) Focus on minimum standards
-> e.g., “Did you pass?”
-> Ignore how much better one person is compared to others.
(3) Absence of relative knowledge
-> You don’t know how someone performs compared to others.
Developmental norms cons
Often interpreted inappropriately
-> Overgeneralization, misinterpreting median…
What is an elevated score?
2 z-scores
Properties of scales (3)
(1) Magnitude
(2) Equal Intervals
(3) Absolute 0
McCall’s T/T-score
Same as standard scores (Z scores), except that the M=50 and SD=10.
Interquartile range
Interval of scores bounded by the 25th and 75th percentiles.
-> bounded by the range of scores that represents the middle 50% of the distribution.
Stanine system
Converts any set of scores into a transformed scale, which ranges from 1 to 9.
M = 5, SD = 2
Overselection
Selecting a higher percentage from a particular group than would be expected on the basis of the representation of that group in the applicant pool.
Tracking
Developmental norms. Tendency to stay at about the same level relative to one’s peers.
Big Data
Revolution in social science research.
= Data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time.
Pearson Correlation Coefficient def
QUANTITATIVE description of the DIRECTION and STRENGTH of a straight-line relationship between 2 variables.
Correlation Coefficient Range
-1 to 1
We cannot use Pearson’s r for _____
Non-linear relationships
-> Non-linear relationships cannot be described, regardless of their strength.
Classical Test Theory (CTT): Assumptions (4)
(1) Each person has a true score that would be obtained if there were no errors in measurement. Observed test score (X) = True test score (T) + Error (E)
(2) Measurement errors are random
(3) Measurement error is normally distributed
(4) Variance of OBSERVED scores = Variance of true scores + Error variance
A person’s true score def
The hypothetical or ideal measure of a person’s attribute we aim to capture with a psychological test.
=> FREE FROM ERROR
Expected score over an INFINITE number of independent administrations of the test
Mean error of measurement = ____
Errors are ____ with each other
True scores and errors are _______
0; UNcorrelated; UNcorrelated
Two tests are parallel if: (3)
(1) EQUAL observed score MEANS
-> Comes from the assumption that True scores would be the same
(2) EQUAL ERROR VARIANCE
(3) SAME CORRELATIONS with other tests
Random error characteristics (3)
(1) Random
(2) Cancels itself out
(3) Lowers reliability of the test
Systematic error characteristic
Occurs when source of error always increases or decreases a true score
-> DOESN’T LOWER RELIABILITY of a test since the test is RELIABLY INACCURATE by the same amount each time
Sources of Measurement Error (3)
(1) CONTENT Sampling Error
(2) TIME Sampling Error
(3) Other Sources of Error (e.g. observer differences)
Reliability Coefficient def
Proportion of OBSERVED test scores accounted for by variability in TRUE scores.
Standard Error of Measurement (SEM) def
Amount of uncertainty/error expected in an individual’s observed test score.
**=> Corresponds to the SD of the distribution of scores one would obtain by repeatedly testing a person. **
Spearman-Brown formula def
Predicts the effect of lengthening or shortening a test on reliability.
Test reliability is usually estimated with what methods? (4)
(1) Test-retest
(2) Alternate (Parallel) Forms
(3) Internal consistency
(4) Interrater/Raters
Test-Retest method is an example of ____ sampling
time
-> Higher when construct being measured is expected to be STABLE than when construct expected to CHANGE
Alternate (Parallel) Forms method is an example of ____ sampling
item
How High Should INTERNAL CONSISTENCY Coefficients Be? (*confond pas avec d’autres coeff)
Higher for “narrow” constructs
Lower for “broader constructs
-> Very high may indicate insufficient sampling in the domain
E.g. Medium internal consistency is bad for a narrow construct (panic disorder), but not so bad for a broad construct (Neuroticism)
What’s the older approach used to estimate the internal consistency of a test?
Split-half method
What’s the contemporary approach used to estimate the internal consistency of a test?
CRONBACH’S ALPHA = AVERAGE OF ALL POSSIBLE SPLIT-HALF RELIABILITIES
Unaffected by how items are arranged in the test
-> Most general method of finding estimates of reliability through internal consistency.
(Kuder-Richardson also a possibility)
Kappa formula
Interrater Agreement
Proportion of the potential agreement following CORRECTION FOR CHANCE.
Domain Sampling Model conceptualizes reliability as the ratio of the variance of the observed score on the _____ test and the variance of the _______.
shorter, long-run true score
Test-Retest Method: Problems
CARRYOVER EFFECTS: Occurs when the first testing session influences scores from the second session.
When there are carryover effects, the test-retest correlation usually ________ the true reliability.
OVERESTIMATES
-> This can happen because the participant REMEMBERS items or patterns from the first test, so their performance on the second test is less independent than it should be.
What method provides one of the most rigorous assessments of reliability commonly in use?
Parallel Forms Method
Problems with Split-Half method (2)
(1) The two halves may have different variances.
(2) The split-half method also requires that each half be scored separately, possibly creating additional work.
KR20 Formula
Equivalent of alpha for dichotomous test (e.g. right/wrong)
Sources of measurement error: (3)
(1) Time sampling: The same test given at different points in time may produce different scores, even if given to the same test takers.
(2) Item sampling: The same construct or attribute may be assessed using a wide pool of items.
(3) When different observers record the same behavior: Different judges observing the same event may record different numbers.
How do we assess measurement error associated with item sampling?
Parallel forms, Internal consistency
What to Do about Low Reliability? (3)
(1) Increase the # of Items
(2) Throw out items that run down the reliability (by running a factor/discriminability analysis)
(3) Estimate what the true correlation would have been (CORRECTION FOR ATTENUATION)
Kappa stat range
0-1.
Kappa = 0 is considered poor -> means the agreement is basically by chance.
Kappa = 1 represents perfect, complete agreement.
When random error is HIGH on both tests, the correlation between the scores will be _____ compared to when the random error is ___.
lower; small
Difference Score def
Subtracting one test score from another
-> two different attributes
Why are difference score unreliable?
Difference scores are unreliable because the random error from both scores is compounded and the true score is cancelled out.
What do we mean when we say that “Validity is NOT a yes/no decision”
- It comes in degrees and applies to a particular USE and a particular POPULATION
- It is a process: An ongoing, dynamic effort to accumulate evidence for a sound scientific basis for proposed test score interpretations
3 Types of Validity
Content, Criterion, Construct
Subtypes of Criterion validity
Concurrent, Predictive
Subtypes of Construct validity
Convergent, Divergent
A test with high face validity may: (3)
(1) Induce cooperation and positive motivation before and during test administration
(2) Reduce dissatisfaction and feelings of injustice among low scorers
(3) Convince policymakers, employers, and administrators to implement the test
-> but sometimes a test with low face validity elicit more honest responses
Types of criterion
Objective & Subjective criterion
Objective criterion
Observable and Measurable
E.g., Number of accidents, days of absence
Subjective criterion
Based on a person’s judgement
E.g., Supervisor ratings, peer ratings
What happens if the criterion measures FEWER dimensions than those measured by the test?
This decreases the evidence of validity based on its content because it has underrepresented some important characteristics
Criterion contamination def
If the criterion measures MORE dimensions than those measured by the test