Assessment and Testing Flashcards
Measurement
process of determining dimensions of an attribute or trait
assessment
processes and procedures for collecting info about human behavior
eg: tests, inventories, interview data, observation, rating scales
appraisal
implies going beyond measurement to making judgments about human attributes and behaviors
- used interchangeably with evaluation
Interpretation
making a statement about the meaning or usefulness of measurement data based on the counselor’s knowledge or judgment
measures of central tendency
distribution of scores can be measured using:
mean: symbolized by M or X (with horizontal line on top)
median: middle score
mode: most frequent scor
these 3 fall in same place when distribution is symmetrical
skew
(see iphone pic)
refers to the degree to which a distribution of scores is not normally distributed
mode=top curve
median=middle score
mean=pulled in the direction of the extreme scores (which is represented by the tail)
a negative skew is with the tail pointing to left; positive is pointing to right (think of how values increase/decrease on horizontal axis)
Standard Deviation
- describes the variability w/in a distribution of scores
- is essentially the mean of all deviations from the mean
- an excellent measure of dispersion of scores
- Use ‘SD’ to signify standard deviation from a sample
- use sigma (think cursive ‘o’ without the first part) for population variability
variance is measured how?
the SD squared (SDˆ2)
normal bell curve
distributes scores into 6 equal parts–3 above the mean, 3 below–such that:
34% & 34%=68%, comprises 1 standard deviation
13.5% & 13.5%=95%, comprises 2 standard deviations
2% & 2%=99%, comprises 3 standard deviations
standardized scores
are scores converted from the individual’s raw score that allows for comparison bw individuals and bw the same individual’s various scores (ie vocab and math)
they basically represent the person’s distance from the mean in terms of standard deviation
two most commonly used standardized scores:
z-score: the mean=0, standard deviation =1, rnge for SD is +3/-3 [the ‘z’ =zero]
T score: mean=50, standard deviation=10. Transforming this score eliminates negative numbers (unlike Z score)
[the “T”=Ten]
two most commonly used standardized scores
z-score: the mean=0, standard deviation =1, rnge for SD is +3/-3 [the ‘z’ =zero]
T score: mean=50, standard deviation=10. Transforming this score eliminates negative numbers (unlike Z score)
[the “T”=Ten]
stanine
from STAndard NINE
converts distributioin into 9 parts, with 5 in middle and SD of ~2
correlation coefficient
- measures reliability
- ranges from -1.00 to 1.00 (perfect correlation
-shows the reln’p bw two sets of #s, but nothing about cause and effect - if the reliability coefficient is high (>=.70), then it’s reliable
bivariate vs multivariate
- correlation bw 2 variables=bivariate
- ” bw 3 or more variables=multivariate
reliability
- a necessary psychometric property of tests and measures
- consistency of a test or measure
- the extent to which a measure is free from error (if the instrument has little error, it’s reliable)
stability
- test-retest reliability using same instrument
- 2 weeks is sufficient bw test administrations
Equivalence
- alternate forms of the same test administered to same group
- comparable forms of the tests, intervening events, and experiences will influence reliability
spearman-brown formula
may use this to see how reliable a split half test would be had you not split it in two
other name for the spearman-brown formula?
prophecy formula
internal consistency
- this is a split-half method where the test is divided into halves and the correlation bw these halves is calculated
- determined by measuring inter-item consistency. the more homogenous the items the more reliable the test
what are the different formulas used to determine internal consistency and when are they used?
Kuder-Richardson formula used if the test has dichotomous items (ie true/false, yes/no)
Cronbach alpha coefficient is applied for nondichotomous items (ie multiple choice, essay)
what is used to determine reliability?
correlation coefficient
- if the reliability coefficient is high (>=.70), then it’s reliable
Kuder-Richardson formula
Denoted as KR-20 or KR-21
Kuder-Richardson formula used to measure internal consistency; if the test has dichotomous items (ie true/false, yes/no)
Cronbach alpha coefficient
Used to measure internal consistency
- is applied for nondichotomous items (ie multiple choice, essay)
True vs error variance
Coefficient of determination
Coefficient of non determination
If 2 tests given and the correlation bw them is .9 (for example), then the true variance measured in common is .9^2=81%.
coefficient of determination=degree of common variance (81%)
coefficient of nondetermination=the unique variance, not common (19%=error variance)
Standard error of measurement (SEM)
Another measure of reliability helpful in interpreting test scores
- helps determine the range in which a persons score probably falls
- aka “Confidence Band” or “confidence limits”
Ex:
A person scores a 92 on a test, and Sem is 5. On a normal curve, 1SD away will be 97 and 1 below will be 87, which is where his score will be 67% of time. 95% of time his calcite will be be 82 and 102 (2SD’s away from his mean of 92).
Validity
Degree to which a test measures what it’s supposed to measure
Face validity
The instrument looks valid (i.e., a math test has math items)
Content validity
The test contains items drawn from the domain of items which could be included
Ex: two pro professors of psychology, 101 device an exam that covers the content that they both teach
Predictive validity
Predictions made by the test are confirmed by later behavior
Ex: the scores of the GRE predict later grade point average
Concurrent validity
The results of the test are compared with other tests, results or behaviors at/about the same time
Ex: scores of an art aptitude test may be compared to grades already assigned to students in an art class
Construct validity
Refers to the extent that a test measures a hypothetical construct such as anxiety, creativity, etc.
Convergent validation
Occurs when there’s high correlation between the construct under investigation and others
Discriminant validation
Occurs when there is no significant correlation between the construct under investigation and others
Test may be reliable, but not valid, but valid test are reliable. True or false?
T
Another name for True variance
Coefficient of determination
Another name for error variance
Coefficient of non determination
Power based tests
No time limits or very generous ones (ie NCE)
Speed based test
Timed, emphasis on speed and accuracy (ie intelligence, ability, attitude)
Norm referenced assessment
Comparing individuals to others
Criterion referenced assessment
Comparing an individuals performance to some predetermined criteria such as NCE cutoff score
Ipsatively interpreted assessment
Comparing results on test within the individual. May also compare an individual’s score on one test with another.
A maximal performance test may generate a person’s ______ on __________
Best performance; aptitude or achievement test
A typical performance may occur on what types of test?
An interest or personality test
what is meant by regression toward the mean and what is another name for it?
aka “statistical regression”
if a person scores very high (>=85%) or very low (<=15%) on a pretest then they will probably score closer to the mean on the post test.
Why? Because of the error resulting from chance, personal, and environmental factors.
defn of intelligence
ability to think in abstract terms; to learn.
also called general or cognitive ability
Intelligence tests
Stanford-Binet Intelligence scales
Wechsler adult intelligence scale (WAIS-IV)
Wechslet intelligence scale for children (WISC-V)
Cognitive abilities test
specialized (intelligence) ability tests
Kaufman assessment battery for children - II
System of multicultural pluralistic assessment (SOMPA). Measures medical, social systems and pluralistic factors
SMAG:
SAT (scholastic aptitude test)
Miller Analogies Test (MAT)
ACT (american college test)
Graduate record exam (GRE)
what do achievement tests measure and example of them
measures what a person has already learned/experienced
- used diagnostically (K-12 achievement tests)
- National assessment of educational progress (NAEP) is a national measure of academic performance
[there’s the national level and the state “levels” below]` - California Achievement tests
- Iowa Test of basic skills
- Stanford achievement test
specialized achievement tests
- general education development (GED)
- college board’s advanced placement program
- college-level examination program (CLEP)
what do aptitude tests measure?
also called ability tests, aptitude measures one’s potential to learn; used to predict future performance
examples of aptitude tests
-Differential aptitude test (DAT)
- O*NET ability profiler (formerly General Aptitude Battery Test, GATB)
- ASVAB
- Career ability placement survey (CAPS)
Projective tests. what do they do and examples
present an unstructured task and the person projects processes, needs, anxieties…
ex:
Rorschbach
TAT (thematic apperception test)
Rotter incomplete sentences blank
Draw a person test
types of personality inventories
- minnesota multiphasic personality inventory
- Californiia psychological inventory (CPI)
- NEO Personality inventory
- Beck Depression Inventory
- MBTI
examples of Interest tests
- Strong interest inventory
- self-directed search
- career assessment inventory
- campbell interest and skill survey
- ONET interest profiler
Intrusive vs unobtrusive measurement
Intrusive: reactive measurement where the person being measured knows they’re being watched and this knowledge affects their performance.
- Ex: questionnaires, surveys, observation
Unobtrusive: nonreactive where data is collected without the person’s awareness or without changing the natural course of events.
- Ex: reviewing existing records or unobtrusive observation
Semantic differential
refers to a scale that asks respondents to report where they are on a dichotomous range bw two affective polar opposites.
- ex: Very bad ____ _____ _____ Very good
- adjective pairs usually have an evaluative, potency, and activity underlying structure that serves as a secondary analysis
Observation as appraisal technique
- Observing samples from a stream of behavior
- may use schedules, coding systems, or record forms
Case/historical study
Analytical and/or diagnostic investigation of a person or group
Rating scales
Used to report the degree to which an attribute or characteristic is present
Sociometry
- Used to identify isolates, rejectees , or stars (popular ppl)
- Requires revealing personal feelings about each other
Social desirability
Tendency for test takers to respond in ways that are perceived to be socially desirable
Grade and age equivalent scores
Scores on an achievement test often reported as grade equivalent scores. I.e., if a student completes the number on a test that the average sixth graders scores, then he has a grade equivalent score of six.
Age equivalent scores work similarly. For agr, an individual score is compared to the average score of others at the same age. So if a 7.5 year-old student earned a score equivalent to an eight-year-old, then 8 would be his age equivalent score.
Percentile ranks
Indicate the percent of people who scored above or below. So if I score in 35th percentile, then I scored higher than 34% of the people and 65% scored higher than me.
Assessment resources
Mental measurements year
- from Buros Institute
- has critical reviews of tests and lists published references of
20th edition published in 2017
Test and print IX (2016)
- Has information on approximately 3000 testing instruments
A comprehensive guide to career assessment
- Published by national career development Association
- edited by Kevin Stoltz and Susan Barclay in 2019
Association for assessment and research and counseling
One of 18 divisions of the ACA
The ________ index indicates the percentage of individuals who answered each item correctly
difficulty. A 0.5 difficulty index (also called a difficulty value) would that 50% of those tested answered the question correctly, while 50% did not.
For example, you set the difficulty index to .25 in order to ferret out the lower 75% you do not wish to admit into a program.
Item difficulty ranges from 0.0 to 1.0. The higher the index number, the easier the question is to answer.
vertical vs horizontal testing
a vertical test would have versions for various age brackets or levels of education (e.g., a math achievement test for preschoolers and a version for middle school children).
A horizontal test measures various factors (e.g., math and science) during the same testing procedure.
what is a test battery?
In a test battery, several measures are used to produce results that could be more accurate than those derived from merely using a single source. (ie horizontal test)
What does Inter-rater testing assess? What are other names for it?
When is it used?
Assesses reliability in qualitative research
Other Names: Inter observer, scorer reliability
Used with subjective tests to determine whether the scoring criteria are such that two people who graded or assessed the responses will produce roughly the same score
- Is reliability calculated by correlating responses of several readers
What is the acceptable reliability coefficient for job selection?
> =.8
Francis Galton
Felt intelligence was a single or unitary factor;
Said intelligence was normally distributed like height or weight, and it was primarily genetic
Differences between fluid intelligence and crystallized intelligence
Fluid intelligence is flexible crystallized is rigid and does not change your data
Charles Spearman
Felt intelligence was best explained via two factor theory—a general ability G and a specific ability S
JP Guilford
Isolated 120 factors that added up to intelligence;
known for his thoughts on convergent and divergent thinking
The Stanford-Binet IQ test is standardized, T/F?
T
Simon and Binet pioneered the first IQ test to_____
identify children with an intellectual disability so they could be taught separately
What happens to a test’s reliability if you increase or decrease its length.
Increasing a test’s length raises reliability, shortening a test’s length decreases reliability