Exam One Flashcards
Psychological testing
Refers to all possible uses, applications, and underlying concepts of psychological and educational tests.
Psychologists’ responsibility around test administration
Duty to select fair (representative), appropriate, updated, reliable and valid tests as scores drive decision-making
Types of psychological tests
- Achievement- refers to previous learning (course material)
- Aptitude- refers to the potential for learning or acquiring a specific skill (SAT)
- Intelligence- general potential to solve problems, adapt, think abstractly, and learn from experience
Types of personality tests
Structured/Objective- multiple choice, true/false,
or Likert scale format, usually self-report
Projective- test materials or required response
(or both) are ambiguous (Rorschach)
How to evaluate utility of tests
Aspects of psychometric soundness
- reliability (consistency)
- validity (accuracy)
Test construction
- item creation and/or selection
- logical vs. theoretical vs. empirical considerations
Test administration
-variation in scores due to administrator, examinee,
and/or random error
Early antecedents for tests
Han Dynasty - Test batteries used for work-related evals
Ming Dynasty- testing rounds in testing centers used to nominate public officials
British missionaries- civil service test system
US- American Civil Service Commission
Darwin/Galton
Darwin
-The Origin of Specices: Evolution acts upon individual differences (survival and reproduction of the fittest)
Galton
-Documented individual differences in cognitive and physical abilities
-Founder of eugenics (selective reproduction
of individuals with “desirable” traits
Cattell
-Individual differences in cognitive and physical abilities
-Coined the term “mental tests”
Experimental psychologists
Donders
- reaction time tests
- cognitive psych experiment
Wundt
- First psych lab
- Sensation and perception
This era drove scientific method of psych testing
(requires rigorous experimental control)
Intelligence tests
Binet-Simon scale- first intelligence test, first use of standardized sample
Stanford-Binet scale- US version; standardized
sample- 1000, edited and new items
Group tests- developed in response to WWI by Yerkes; Army Alpha and Army Beta (1917)
Wechsler Intelligence Tests - included nonverbal subscale of intelligence (“performance”)
standardized sample
norm-based sample = comparing score to other people
representative sample
comprises individuals similar to those for whom the test is to be used
Mental age
measurement of a child’s performance relative to other children of that age group
Personality tests
- measures traits
- Woodworth Personal data sheet- military recruits likelihood of “shell shock”
- Rorschach
- Thematic Apperception Test
Modern personality tests
Objective Tests - no assumptions about meaning
of a test response
MMPI, CPI, 16PF (based on factor analysis (finds minimum number of dimensions to account for large # of variables)
Descriptive statistics
Statistics describing the sample or population.
measures of central tendency and variance
-can be used with ANY type of data
-including experimental or non-experimental data
Inferential statistics
Statistical procedures that allow inferences to be made from the sample to the population.
-infer causality
-more limited to experimental data
-type of data dictates type of analysis used
-must be careful of data distribution
(parametric vs. nonparametric)
Nominal data
Categorical data; no mathematical meaning
(dichotomous if two categories)
gender; political party, religion, species, team
Ordinal data
Indicates order- cannot know how far apart each item is (no equal intervals)
first to last; most to least
-basketball standings, sibling-line position, IQ scores
Interval data
True score data but there is no true zero; does not have equal intervals.
- temperature in degrees, SAT scores
- most psychological measures; Likert scale
Ratio data
Interval data with true zero.
most physical measures- height, weight,
speed, distance, volume, area
Normal distribution
Bell shaped, symmetry around central tendencies
- most stat procedures in PSYC assume normally
distributed scores
- parametric stats are based on symmetrical
(normal) distributions
Characteristics of parametric distributions
-approximate symmetry
-the distribution can be divided into standard deviation units
-the size of the deviation can be mathematically
defined on any measure that is interval or ratio in nature (skew)
Skew
The degree of departure from symmetry
Positively skewed- most S’s fall on the L side; tail skews right.
Negatively skewed- most S’s fall on the R side; tail skews left
Bimodal- 2 areas of the curve at equal frequencies with a dip in between
Variance
The variation of or differences among people in a distribution across the measure X
- arises from natural, random differences among Ss
- environmental variations
- measurement error
- researcher error (overt, covert)
Percentile ranks and how to calculate
Percentile ranks- the percentage of scores that
fall below particular score within distribution
Calculate:
- divide number of cases below the score of interest by total number of cases in the group
- multiply results by 100
Standard scores
z-scores: raw scores that are converted to fixed mean and standard deviation
-score measured in SD units (the deviation of a score from the mean in SD units)
Calculating a z-score
- Find difference between observed score and mean for the distribution
- Divide difference by SD of distribution
Mean exam score= 11.05 (SD is 7.01)
For a score of 14, z score is .42
Norms
Allow for evaluation of one’s performance relative to a larger group
Norm-referenced tests
-each test taker’s performance evaluated against standardized sample
-typically used for the purpose of making comparisons
with a larger group
-norms should be current, relevant, and representative
of the group to which the individual is being compared
Criterion-referenced tests
-represent predetermined level of performance to be reached (“benchmarks”)
-scores are compared to a preset “criterion score” (not
compared to others)
-No Child Left Behind
Correlations v. regression
Correlation assesses the magnitude and direction of a relationship. Regression is used to make predictions about scores on one variable from knowledge of scores on another variable. These predictions are obtained from the regression line (line of best fit).
Correlation coefficient (r)
-strength of association between variables
-Ranges between -1.0 and +1.0
-Calculating correlation between 2 variables for entire group, not 1 individual
-Reflects the amount of variability that is shared between 2 variables
+/- .10: weak, +/- .30: moderate, +/- .50: strong
p-value
Indicates whether the association is greater than what would be accepted by chance.
shared variance (r2)
Common variance, effect size, coefficient of determination
Correlation does not equal causation
- Mediating variables may explain the relationship
- Relationships can be bidirectional (thus both would be causal)
- Causality can be inferred only under experimental manipulations
Experimental conditions
Experiments:
- random assignment of participants
- manipulation of at least one independent variable
Coefficient of determination
Correlation coefficient squared and then converted into a percentage; indicates effect size
Coefficient of alienation
A measure of nonassociation between two variables. subtract r2 from 1 (where r is the coefficient of determination)
Statistical significance
p
Reliability
refers to the accuracy, dependability, consistency, or repeatability of test results
Classical test theory
-Assumes each person has a true score (T) that
would be obtained with no errors in measurement
-Because measure instruments are imperfect, the observed score (X) for each person almost always differs from person’s true ability
-Difference between observed and true score = measurement error (E)
-T (true score) = X (observed score) - E (measurement error)
-Major assumptions- errors are obtained randomly and are normally distributed
- cannot be eliminated
-some error is systematic
Standard error of measurement
Provides estimate of how much individual’s score would be expected to change on re-testing with same/equivalent form of test
- Avg the scores over the infinite number of tests, the average of scores is considered an estimate of the true ability/knowledge (T true score). The standard deviation of all those scores= SEM
- creates a confidence band within which a person’s true score would be expected to fall
Domain sampling method
Instead of testing your ability to spell every possible word, we select a random sample of words.
T- % correct in spelling all words in English language
X- % correct in spelling all words in sample
- As sample gets larger (the closer T and X are), reliability increases and error decreases
- Because we do not know T:
- Calculate the correlation between all sampling times (Xs)
- Correlations are then averaged to predict T
Item response theory
Newer method & more preferred to the CTT, the
IRT instead uses an alerting method to assess ability
-Test increases in difficulty if get previous Q right
-Test decreases in difficulty if get previous Q wrong
-Level of “ideal” ability is heavily sampled
Overall result is a more reliable estimate of ability
Measurement error affecting reliability
- questionable measurement precision
- item sampling
- construction of test items
- factors related to test environment
- varying judgments or beliefs of raters/observers
- scoring of the test (objectivity of evaluator)
- difficult of the test
- factors related to test-taker
Measures to assess reliability
- test-retest
- parallel forms (ideal but rarely used)
- internal consistency reliability (single test, most frequently used- IRT)
Test-retest
The same test is administered to the same person at different points in time.
-also called time sampling method
-only useful when assessing stable traits
Reduce carryover or practice effects
-The interval between measurement must be considered:
-shorter intervals -> higher carryover
-Be careful of developmental milestones
Parallel forms
Compares scores on two different measures of the same quality
-also called equivalent or alternate forms method
A rigorous assessment of reliability
- carryover effects are eliminated
- greater sampling of domain
- Generally underutilized
- difficult to get people “back in the door”
Internal consistency
Extent to which different items on a test measure the same attribute or trait. Scores from 2 halves are correlated with each other
Methods to assess internal consistency
- split-half
- KR20 (Kuder & Richardson)
- Cronbach alpha (coefficient alpha)
Split-half reliability
One test is split into two equal halves
Each half is compared to the other
-can be split randomly, first/second halves, or odd/even
The Spearman-Brown formula is used to correct for half-length and increases the estimate of reliability
KR20 reliability
-Simultaneously considers all possible ways of splitting methods (avoids problems of split-half methods)
-Only appropriate for tests in which items are
dichotomous (0 - incorrect/ 1- correct)
-finds the proportion of people who got each item right v. wrong
Coefficient alpha
Cronbach alpha: considered to be the most general and rigorous formula for determining reliability estimate through internal consistency
-can be used on Likert scales, when items can’t be classified as “right” or “wrong”
Inter rater reliability
Measure of reliability in behavioral observation studies
- code a behavior from observational or behavioral study- compare degree of overlap among different observers
- Start with an ethogram- operational definitions of variables
Kappa statistic
indicates actual agreement as corrected by level of chance agreement among different raters
1.0= perfect agreement between observers
Reliability coefficients
Range from 0.0-1.0
1.0= perfect reliability
=.90, then 10% of variation in scores attributable to measurement error
.90 and above= test highly reliable
.70 - .89 = moderate
Validity
Extent to which a test measure the quality it purports to measure
-test is accurately reflecting whatever construct, trait, or characteristic that it claims to measure
Evidence for validity comes from showing the association between the test and other variables.
Face validity
- Based on logical v. stat. analysis
- The appearance that a test measures what it purports to at a surface level
Content validity
Evidence that the content of a test adequately represents the conceptual domain it is designed to cover
- test items are a fair sample of the total potential content and relevant to construct being tested
- based on logical analysis v. statistical analysis
- Construct underrepresentation: failure to capture important components of a construct
- Construct irrelevant variance- scores are influenced by factors irrelevant to the construct
Can a test be content valid without being face valid?
- depression measures
- child abuse queries
Criterion validity
Extent to which a test corresponds with a particular criterion (standard against which test is compared)
- typically used when objective is to predict future performance on an unknown criterion
examples:
pre marital test marriage success
SAT college freshman GPA
Sub classes of criterion validity
Predictive- test or measure predicts future performance/success in relation to a particular criterion . Correlation (r) to describe extent to which 1 variable is predictive
SAT -> success in college
Concurrent- -concurrent measure is taken at same time as test
-correlation (r) to describe extent to which 1 variable correlates with another at the same time
work samples -> job performance
Validity coefficient
Relationship between a test and a criterion-
usually Pearson r
-tells the extent to which the test is valid for
making statements about the criterion
Less consensus regarding size of VCs
- coefficients of .60 or higher are rare
- .30 - .40 considered to be acceptable
- even tests with lower validity coefficients can yield useful information
- link between cholesterol and heart disease is quite low, but importance predictive consequences for reducing mortality rates
Construct validity
Process used to establish meaning of a test through a series of studies
- simultaneously define a construct and develop tests to measure it
- look for correlation between the test and other measures
Convergent evidence for construct validity
evidence that a test measures the same attribute as do other measures that purport
to measure the same construct
- tests should correlate well (highly) if believed to measure same construct
-what measures should a new depression measure/Health Index/reading ability test correlate with?
Discriminant evidence for construct validity
evidence that a test measures something different from what other available tests measure
-test would not correlate with unrelated tests
Incremental validity
Measure of unique information gained through
using a test
-how much does information from test add to what is already known?
-how well does it improve the accuracy of decisions?
-based on logical analysis vs. statistical analysis
Test item writing
- define clearly what you want to measure (operational defintion)
- Generate an item pool (more items than you will end up including)
- Avoid long/difficult Qs
- Avoid items that convey 2 or more ideas
- Consider making positively and negatively worded items
- Be mindful of diversity
Item format- dichotomous
- 2 alternatives for each item
- overall less reliable and therefore less precise
Item format- Likert
- rating scale with a continuum of alternatives to indicate agreement
- may or may not contain a neutral point
- is open to factor analysis
Item format- polytomous
- multiple alternatives for each item
- probability of selecting correct answer by chance is lower
- diminishing returns
Item format- category
- rating system typically using more alternatives (1-10)
- heavily context dependent (reduces validity)
- diminishing returns
Item analysis
General term for a set of methods used to evaluate test items
Item difficulty
- asks what percent got item right
- usually want ID to fall between chance level and 100% (usually .30-.70)
- if 84% get #1 correct, ID is .84.
- the higher the number, the easier the item
Calculating item difficulty
Calculating optimal item difficulty:
- Subtract chance from 100% success (1.0)
- Divide by 2
- Add this value to chance
If chance is .25 (4 alternatives):
- (1.0-.25) / 2 = .75/2 =.375
- .375 + .25 = .625
Item discriminability
Determines whether people who have done well on a particular item have also done well on the entire test
The Extreme Group method
- type of item analysis
- compares those who did well to those who did poorly
- calculation of a discrimination index - find the difference in the proportion of people in each group who got each item correct
- Higher Discrimination Index= Higher Discriminability
Point Biserial method
- type of item analysis
- Correlation between a dichotomous and a continuous variable (individual item versus overall test score)
- Is less useful on tests with fewer items
- point biserial correlations closer to 1.0 indicate better questions
Limitations of item analysis
- test analysis can tell us about the quality of a test, but it doesn’t help students learn
- Purposes of tests are varied, and may emphasize ranking students over identifying weaknesses or gaps in knowledge
- If teachers feel they need to “teach to the test” the outcomes of a test may be misleading and indicate more mastery than actually exists
Relationship between examiner and test taker
- role of feedback (type of feedback given to test taker)
- role of race and gender of tester on test taker
- role of language of the test taker (tests are highly linguistic)
2 types of stereotype threat
- anxiety over how one will be evaluated and how well s/he will perform
- for members of a stereotyped group, pressure to disconfirm negative stereotypes
Stereotype threat hypotheses
- STT depletes working memory
- STT leads to reduced effort and, in turn, reduced performance
- STT causes physiological arousal that can disrupt performance
Response acquiesance
respondents give response they perceive to be expected
Expectancy effects
(Rosenthal effects):
- can influence what interviewer expects out of interviewee
- told a child is “smart” or “bad” ahead of time
- giving examinee “benefit of the doubt” because he/she is pleasant
Subject effects
(Hawthorne effects):
- can influence what subject expects out of test/interview
- may act in accordance with those expectations
Empirical findings about manner in which tests is administered
- the less personalized the modality, the more likely information is to be disclosed
- will disclose even more when confidentiality of responses is ensured
Advantages of computerized administration
- responses automatically recorded (reduces error)
- standardization ensured
- precisely timed responses
- examiner bias controlled
Structured interview
- specific set of Qs
- standardized- Qs are printed use exact phrasing
Unstructured interview
- use transitional phrases or playback/restatement/summarizing/clarifying/understanding statements
- goal is to lead to elaboration by interviewee within minimum effort by interviewer to maintain the flow
Clinical interview v. assessment interview
Clinical interview is used when you will likely be seeing the client moving forward in therapy whereas an assessment interview is conducted for the purpose of gathering information to answer a referral question ie “does this child have ASD?”
-Assessment interview, you are more likely to use standardized tests (intelligence, personality, paper-pencil) and talk to multiple sources
Interview validity
- Seek convergent or even divergent validity
- Correlate interview data with other measures (GPA, job performance, etc.)
- Usually moderate validity coefficients (.40)
Errors that bias interview validity
- early impressions “stick” even if evidence to the contrary emerges
- One prominent characteristic of interviewee biases interviewer’s judgments
- misunderstanding of cultural differences
Interview reliability
-interviewer reliability coefficients are quite variable
-Unstructured interviews have the lowest reliability, though they may lead to fairer outcomes than other asessment tools
-Interviews vary in their standardization-they can focus on different areas of importance
Structured interviews provide higher reliability estimates
-don’t provide as much or as varied information as unstructured or semi-structured interviews