testing and measurement 2 Flashcards
6 Steps to Test Development
1) defining purpose
2) preliminary design issues
3) item prep
4) item analysis
5) standardizing & research
6) prep of final product
Step one of test development
Statement of purpose, simple one sentence
-include character trying to measure, target
Preliminary Design Issues
Step to:
Mode of administration, length, item format, number of scores, score reports, administrator training and background research
Mode of Administration
Group or Individual
Item Format
multiple choice, true/false, agree or disagree, or constructed by the responder (written answers)
Number of Scores
Related to length, how many scores
Score Reports
computer generated, hand written? total score, norms, subgroups
Administrator Training
Extensive professional training to administer, score and interpret? How will that be provided? Or no training?
Background Research
standard lit on things being studied, and study of clinicians who would use the test
Anatomy of a Test Item
Stimulus, Response Format (Conditions Governing Response), Scoring Procedures
Stimulus
the question being asked
Response Format
how can the person respond? Multiple Choice or T/F or constructed (meaning anyway you want)
Constructed Response
The person taking the test respond in anyway they choose, written responses, free response
Conditions Governing the Response
what influences response, time limit, can the administrator ask for clarification, answer sheet or writing etc
Scoring Procedures
Partial credit, correct/incorrect, constructed response
Two Types of Test Items
Selected-Response Test Items, Constructed Test Items
Selected-Response Test Items
multiple choice, forced choice, likert format, true/false items
Scoring Selected-Response Items
correct/incorrect, sometimes using weighted questions
Constructed Response Example Items
Essay Test, Performance Assessment, Portfolio
Scoring Constructed-Response Items
need to have inter-rater reliability, and conceptualizing a scheme for scoring
Holistic Score
scoring constructed response items by the rater giving them one whole score
Analytic Scoring
constructed response item scoring where the rater assesses different dimensions of the test (and they might even be rated by different people)
Point System
Point system of scoring Constructed Response Items, awarding points for certain predetermined aspects of things
Automated Scoring of Constructed Response Items
Using sophisticated computers to judge free responses by simulating human judgement
Suggestions for Writing Selected-Response
Extensive List: but keep it simple, get to the point, don’t give away the answer
Suggestions for Writing Constructed-Response
Task Should be clear, specific about scoring system when item made, use a sufficient number of items
Pros of Selected Vs. Constructed
Scoring reliability, takes less time to get more information, the scoring is more efficient
Pros of Constructed Vs. Selected
easier to understand how test taker thinks, they can explore more personal difference (oddities that wouldn’t come up in selected response)
Item Analysis
involves item tryout, statistical analysis and item selection (figuring out which items are ‘good’ or ‘bad’)
Item Tryout
two stages, formal and informal
Formal Item Tryout
administering test items to samples of target population
Informal Item Tryout
very small groups of the population asked what they think about the items, or think aloud as they complete them
Item Difficulty
percent of population who gets something right or wrong
P-Values
the item difficulty levels are often called this, meaning the p (percentage) who got it right
Item Discrimination
Item’s ability to differentiate statistically between groups of examinees
Distractor Analysis
a distractor is an incorrect or non-preferred item, and analyzing those shows misinterpretation of question etc
Factor Analysis
Used to determine which items are going to provide better scores
Item Selection
Choosing which items that 1) increase reliability of test, 2) finding the right average difficulty, 3) items that can discriminate between groups and 4) D (discrimination) when P (difficulty) is at its midpoint, 5) make sure the content is actually covered, don’t eliminate important items
Standardizing Program
shows the norms for the test
Equating Programs
making sure tests equate to one another
Publishing Tests Materials
Technical Manuals, Score Reports, Supplementary Materials, Test Completed, Administrator Training
Continuing Research on New Tests
Updating new norms and discovering applicability
Two Classical Theories of Intelligence
Spearman’s g, and Thurstone’s primary mental abilities
Spearman’s Theory of Intelligence
Intelligence, g, is general intelligence. S were a variety of tests/subtests that made up g.
Two factor, g and s, theory
Thurstone
Primary Mental Abilities theory of intelligence
Primary Mental Abilities Theory of Intelligence
Thurstone’s, originally 9 mental abilities, a multiple-factor theory
The Original Nine Primary Mental Abilities
Spatial, Perceptual (speed of perception), numerical, verbal, memory, words, induction (finding a rule or principle to solve a problem), Reasoning (arithmetic), Deduction (factor weakly defined calling for application of a rule)
Hierarchical Model
Compromise, different intelligences are arranged with some more important than others
Cattell
Fluid and Crystallized Intelligence
Hierarchical Characteristics
Complex factor analysis, separate intelligences, some better than others
One Vs. Many
argument of intelligences, Spearman says 1, Thurstone says many
Gc
crystalized intelligence by Cattell, sum of everything one has learned, mental skills, education, relationships etc.
Gf
General fluid intelligence is the raw mental power, potential for intelligence
Additional Factors for Cattell & Horn’s Model
short and long term memory, visual and auditory skills, processing speed on easy tasks, decision speed (problem solving tasks) and quantitative reasoning
Vernon’s Model
Hierarchy, all under g, then split into v:ed (verbal:educational) and then into k:m (spatial:mechanical) and then some of the other skills cluster under these (numbers, psychomotor, reading)
Carroll’s Summary
Three-stratum theory
Three Stratum Theory
g at the top, then Gc and Gf (as well as others, some like Thurstone’s), third level there are more specific abilities
Developmental Theories
1) stages, 2) stages happen in the same order for all people (if not the same time), 3) stages are cumulative and not reversible
Piaget Theory of Cognitive development
4 stages
Sensorimotor
no object permanence, lack of input
birth-2 yrs
Preoperational
use words to symbolize, lacks principles of conservation
2-6 yrs
Concrete Operational
Uses principles of conservation and reversibility
7-12 yrs
Formal Operational
Mature Adult thinking in terms of hypotheses, cause and effect
12+ yrs
Information Processing Model
theory of intelligence that focuses on how people processes what happens, computer processing
Biological Models
brain functioning, as the basis for understanding human intelligence
Assimilation
putting things into your schemas, all four legged animals are dogs to kids
Accommodation
changing your perception to fit reality, horses aren’t dogs
Howard Gardener
Theory of Multiple Intelligences, at least 8 intelligences
Gardener’s 8 Multiple Intelligences
Spatial Linguistic Logical-mathematical Bodily-kinesthetic Musical Interpersonal Intrapersonal Naturalistic
3 Things to Remember about Group Differences
1) Distributions mostly overlap, even if averages are slightly different, 2) a difference doesn’t tell us why, 3) difference are always changing, and may not last forever
Differences in Intelligence by Sex
minimal in terms of total scores, some difference in verbal and spatial skills. More males tend to perform very high or very low
Group Age differences
steep increase: 0-12
Maximum: 16-20
Level: 25-60
Period of Decline: 60+
Flynn Effect
Increase in IQ scores of 3 pts per decade (meaning 12 pts across generations)
Group intelligence by Race
Hispanic and Native America lower than white by 1/2 to 1 SD
Black about 1 SD below white
Asian about 1 SD above on non verbal
Number of Chromosomes
23 pairs, 46 total
DNA effects intelligence
Behavioral Genetics
the genetic and environmental basis for differences in psychological traits
Heritability of Intelligence
.6 of general intelligence is related to genes
Common Features of Individual Intelligence Tests
1) most are individually administered
2) administration requires advanced training ) 3) wide range of age and ability (w/ start and stop rules 4) establish rapport with examinee 5) free response 6) immediate scoring 7) about 1 hr for test 8) opportunity for observation
Two Main Uses of Intelligence Tests
Clinical/School/Counseling or for research
Sir Francis Galton
using simple measures, studied heritability, and used bivariate/invented bivariate distribution
Alfred Binet
Father of intelligence testing, Binet scale 1905, mental age
Lewis Terman
revised Binet to the Standford-Binet
Intelligence Quotient
Mental Age/True Age times 100= intelligence
Arthur Otis
Revised Standford Binet and invented army alpha and beta for group testing
David Weschler
Invented Weschler tests, with standardized IQ score of M=100 and SD= 15
Frequency Distribution for Weschler Scale (3 SD)
55 is 3 SD below, 145 is 3 SD above, 99.7% fall within these numbers, .3% fall outside (below or above)
99.7% on 100 mean IQ
fall between 55 and 145
Validity of IQ Tests
Predicts school success, reliability of .60/.70
Army Alpha
Created by Otis, Army intelligence tests for literate recruits
Army Beta
Created by Otis, Army intelligence tests for illiterate or immigrant recruits
Number of Group Mental Ability Tests given per year
50 million
Achievement tests and group Mental Ability Tests
given together they show differences between ability and achievement
Four Major Uses of Group Mental Ability Tests
1) in schools 2) Predicting success 3) job selection 4) research in social and behavioral sciences
8 Common Characteristics of GROUP mental ability tests
1) given to large groups 2) content similar to individual 3) multiple choice, machine scored 4) fixed time limit/number of questions 5)45-60 min OR 2.5- 3 hours 6) one total score, several subscores 7) normative samples very large 8) main purpose is prediction
Start-Stop Rules
For individual tests, tells the person where to start or stop the given questions
Multilevel Tests
group tests are multilevel, there are different tests for different ages or grades
Otis-Lennon School Ability Test
most widely used, scholastic achievement for scholastic grade levels
OLSAT 8 Structure
there are 7 levels for Kindergarten through grade 12, about 1 hour, test levels overlap to assess students above or below
OLSAT 8 Framework & Items & Philosophy
Uses Vernon’s hierarchical model, looks for V:ed, and there are 8 items with cluster
OLSAT 8 Item Clusters
Verbal comp, verbal reason, pictorial reasoning, figural reasoning, quantitative reasoning
OLSAT 8 Scores
total score: verbal + nonverbal and these three scores are used to find SAI:
School Ability Index: M= 100, SD= 16
Normed by age to 3 months
Used to predict performance
OLSAT 8 Norms
Normed for both fall and spring, and for Socio economic status, geographic region and ethnicity
Two Cautions about OLSAT 8 Norms
1) We don’t know who wasn’t there, thus excluded from norms 2) We don’t know motivation of students (how hard were they trying)
OLSAT 8 Reliability (total, verbal/nonverbal, SEM)
Internal Total= .89 to .94
Verbal and Nonverbal= .81 to .90
No Test-Rest Data
Standard Error of Measurement- 5.5 to 5.8
OLSAT 8 Validity
Criterion related with the Stanford Achievement Test, no factor analysis though
College Admissions Tests 3 Purposes
1st: Selection of Students
2nd: placement into courses in college
3rd: describe the college (our students average score = ____)
SAT 2005
Critical Reading, Math, Writing (as of 2005)
attempts to measure general abilities developed in school
Takes 3 hours, 20 min
SAT Test Items
Critical Reading, Mathematics, Writing
Critical Reading SAT
Sentence Completion, Reading passages
25, 25, 20= 70 total in 3 sections
Mathematics SAT
Multiple chice, grid-in
25, 25, 20= 70 in 3 sections
Writing SAT
essay, multiple choice
23, 25= 60 total in 2 sections
HSGPA, SAT = college success
correlate about equally with success, .5 reliability
SAT Scores and Norms
M= 500 SD= 100, range= 200 to 800
Percentile norms adjusted annually, norms determined nationally
Reliability of total scores, main tests, subscores and writing in SAT
Total= .95
Main (math and reading)= .85-.90
Subscores = .65- .85
Writing = .6
SAT Validity
actually uses predictive validity, compares to Freshman GPA
Multiple Regression for HSGPA, FGPA and SAT to FGPA then both
Weakness of SAT Validity
can only be compared to those who go to college
SAT Correlations to FGPA
FGPA & SAT= .5
FGPA & HSGPA= .5
Both and FGPA= .6 (incremental validity)
GRE (Graduate Records Examination)
as of August 2011 content
Verbal Reasoning, Quantitative Reasoning, Analytical Writing
GRE Reliability for Verbal, Quantitative, Analytical
Verbal= .93
Quantitative= .94
Analytical Writing = .79
Culture Fair Tests
Trying to create tests that are fair to people across cultures
Raven’s Progressive Matrices
Example of Culture Fair attempt, lots of non-verbal, measures nonverbal g well, 3 versions (colored, standard, advanced). Uses lots of pattern completion
Three Generalizations of Culture Fair tests
1) Mainly measure figural/spatial intelligence (not general)
2) less predictive for jobs and school than verbal
3) still present group differences
Six Generalizations about Group Mental Ability Tests
1) same content as Individual, 2) Reliability good for total, less for sub-scores 3) predictive validity between .30-.60, 4) sub-group validity generally poor 5) Range restrictions and imperfect reliability in criterion 6) Culture-Fair tests still don’t exist
6 reasons for Clinical Neuropsychological (CN) Assessment
1) Diagnosis, 2) Finding strengths and weakness 3) vocational and educational planning 4) treatment planning and evaluation 5) forensics 6 researc
Fixed Battery Neuropsychological tests
standardized tests given to everyone with fixed cut offs
Example: Halsted-Reitan Neuropsychological Batter
Impairment Index
uses a cut off point to determine if there is or is not neuropsychological deficit
General Neuropsychological Deficit
reflects severity of neuropsychological deficit
-good test-retest reliability, discriminates those with brain damage from not with 80% accuracy
Flexible Batteries of Neuropsychological Tests
Varies by reason for referral, clinical data, ability to cooperate, information obtained, tailored on individual basis
Mini-Mental Status Examination (MMSE)
Structure and scores
most routinely administered
11 questions, 30 points
Scores 24-30 are in normal range (but may still have impairment)
MMSE Assesses What:
Orientation, Attention/Concentration, Language, Cognitive Flexibility, Constructional Ability and Immediate or brief delay recall
Premorbid Intelligence/Achievement
Intelligence before the onset of impoairment, school records or Wechsler often used
Continuous Performance Tasks (CPT) (How it Works)
one of many ways to evaluate attention
- measures the ability to respond to sequential presented target stimuli and not respond to non-rarget stimuli over long period and in face of boredom
The Continuous Performance Tasks Measures
Ability to maintain alertness/vigilance or sustained attention
Continuous Performance Tasks Brain Areas
reticular formation and the frontal lobes
Wechsler Adult Intelligence Scale (WAIS-IV) Working Memory Subset
Another way to evaluate attention
WAIS - IV Working Memory Subset item types
Arithmetic, Digit Span (forward and backward) and letter-number sequencing
Brain Areas the WAIS-IV Working Memory Subset tests
frontal lobes, particularly the dorsolateral prefrontal cortex
Trail Making Test (Part A Halstead-Reitan)
measures attention
draw a line connecting 25 numbered circles as quickly as possible without lifting the pencil
Most Frequent Linguistic Impairment
Naming ability, called dysnomia
Dysnomia Assessment
assessed by procedures that require the naming of line drawings on visual confrontation
Boston Naming Test (structure and brain area)
60 line drawings to be named, increasing difficulty
Brain Area: left temporal lobe
Controlled Oral Word Association Test (COWAT)
Looks at Verbal Fluency in two categories 1) Letter (phonemic) and Semantic (category)
Letter (Phonemic) part of COWAT Assesses what skills and what brain area
Controlled Oral Word Association Test
60 seconds
Words that begin with letters (no proper names or repeating)
Measures Frontal Lobe
Semantic (Category)
60 seconds
categories
brain area: temporal lobe
Block Design for finding Constructional Apraxia (test)
WAIS-IV Subtest which requires the person to reproduce a 2X2 or 3x3 design with red and white blocks
Constructional Apraxia
Inability to assemble or copy 2 or 3 dimensional objects (Visual Spatial)
Hooper Test of Constructional Apraxia
30 common objects that have been cut into pieces (visually) and examinees need to reassemble in their heads and name object
Spacial Neglect
Inattention to one side of space (usually left parietal lobe and right parietal lobe damage)
Line Bisection Spatial Neglect Test
Examinee is asked to bisect lines on a page placed at midline
Clock Drawing Spatial Neglect Test
examinee first is asked to draw a clock to command (as in 10 min after 11) and then to copy a clock drawing with the hands already set
Memory & Neuropsychological Evaluation
Memory is the most frequent complaint by persons referred for neurological evaluation
Example of Nonverbal Memory Test (& brain area)
verbal = left hemisphere
California Verbal Learning Test (CVLT-II)
Rey Complex Figure Test (whats it for and brain area)
Measures Nonverbal memory- right hemisphere
RCFT
CVLT - II (Name, Structure, Function)
California Verbal Learning Test (verbal memory)
list, distractor list, immediate and long delay recall, yes/no long delay recognition trial and finally forced choice long-delay recognition trial
Dementia
progressive/incurable disorder marked by memory loss and disturbances of higher mental functions (5% of older adults)
Dementia Diagnosis
a noticeable decline from previous social or occupational functioning (not related to medical/psychiatric conditions, 2) significant impairment of memory function 3) at least one of the following: aphasia, apraxia, agnosia, executive dysfunction
Aphasia
impairment of language
Apraxia
decline in motor skills
Agnosia
inability to identify familiar object/faces
Executive Dysfunction
difficulty planning etc
Pseudodementia
related to psychiatric condition, and cognitive impairments similar to dementia
Characteristics of Pseudodementia
-depressed mood, no language impairment, better recognition than recall memory, gives up easily but will persist with encouragement, saying ‘i don’t know’ not wrong answer, problems often improve with encouragement, retesting or antidepressants
3 Areas of Motor Functioning Tested
Grip Strength, Fine Motor Coordination, Motor Speed
Hand Dynamometer (main test and what tested)
Measures Grip Strength
Part of Halstead-Reitan
Grooved Pegboard Test
Fine motor coordination
Finger Tapping
Measures Motor Speed
part of Halstead-Reitan
WAIS-IV Psychomotor Tests
Motor Speed
Uses cancellation, coding and symbol search
Supervisory Cognitive Processes
involved in the organization and execution of complex thoughts and behaviors
Part of Executive Functions
Three Processes Underlying Executive Functions
1) working memory 2) inhibition and switching 3) sustained and selective attention
Executive Function Tasks Used to Measure
Tower Test, Trail Making Test (Part B), Stroop Interference Test, Wisconsin Card Sorting Test
Stroop Interference test
words reading/color naming/incongruent color naming
Stroop Effect
also called interference effect, when red is printed in green, say color instead of ink
Stroop Interference Test Activates Parts of Brain
dorsolateral prefrontal cortex (preparing to exert conscious control) and anterior cingulate
Anterior Cingulate
involved in consciously regulating conflicting cues and inhibiting responses that are incorrect
Wisconsin Card Sorting Test
Most frequently used Executive Function
4 stimulus cards: 128 response cards
sorting into categories, not told what criteria is, continues until 6 categories are completed (6 categories - 10 each)
MMPI-2
Minnesota Multiphasic Personality Inventory
most frequent used objective inventory
used to assess psychological state
BDI-II
Beck Depression Inventory
most widely used self-report of depression
asses psychological state
Malingering
faking deficits for secondary gain
MMPI-2 Fake Bad Scale
used for malingering
TOMM
Test of Memory Malingering
50 item recognition (two learning trials & and optional retention trial)
Supplementary Information to Evaluate Neuropsychological Assessment
Medical History, Psychiatric History, Psychosocial History, School Records, Collateral Information (family, friends, caretakers) and behavior observations