Week 3, Measurement, Key Terms Flashcards
Measurement
The assignment of numbers to objects or events according to a set of rules
Indicators
In psychological measurement we do not measure constructs directly (try to put a finger on IQ…) .
Instead we measure the characteristics or properties associated with individuals.
We measure indicators (signs that point to something else).
Why not measure organizational constructs directly?
We loose specificity as we move from micro – to macro level – easier to do direct measurement at the indiv. level than it is to do at the Org. level
Scales of Measurement
Psychological measurement varies in precision.
Differences in precision are reflected in the types of scales on which particular characteristics are being measured.
Four levels of measurement
Nominal
Ordinal
Interval
Ratio
Nominal measurement
Lowest level of measurement
Represent differences in kind
Individuals are assigned or classified into qualitatively different categories
Merely labels
Frequently used to identify or catalog individuals and events
Ex.
SS#
Assign 1 to males and 2 to females
The classes must be mutually exclusive
Ordinal Measurement
Not only allows classification by category, but also provides an indication of magnitude
Rank ordered according to greater or lesser amounts of some dimension
If (a>b) and (b>c) then (a>c)
In top down selection this may be all the info that we need to know
Interval Measurement
Have other useful properties
Scores can be transformed in any linear fashion without altering the relationships between the scores
Allows two scores from different tests to be compared directly on a common metric
Standardization
Ratio Measurement
Highest level of measurement
In addition to equality, transitivity, additivity, the ratio scale has a natural or absolute 0 point.
Height, distance, & weight are all ratio scales
Don’t see these scales much in psych measurement
Psychological Measurement
Principally concerned with individual differences in traits, attitudes, or behaviors.
Trait – a descriptive label applied to a group of interrelated behaviors
Based on standardized samples of individual behavior we infer the position or standing of the individual on the trait in question
Systematic Nature of Measurement
TEST - a systematic procedure for measuring a sample of behavior.
Procedures are systematic in order to minimize the effects of unwanted contaminants (error or bias)
What is the difference between a personality “test” and a test of cognitive ability?
Found in:
Mental Measurements Yearbook
&
Publishers
&
3rd Party (e.g. Rocket-Hire)
&
Authors* (Taking the Measure of Work)
Classifying tests
Content
Tests may be classified in terms of the task inherent in the scale
Ex Cognitive ability tests
Achievement
Aptitude
VS
Non-cognitive instruments (or inventories)
Tests may also be classified in terms of the efficiency with which they can be administered.
E.g.
Individual vs. Group
Speed vs. Power – designed to prevent perfect scores (always want variability on measurement tools)
Speed test – more items than you can answer in an amount of time
Power- you can take as long as you want to answer the items, scored by correct answers – the longer you take to take the test – the more variance you get in the scores – it could take someone 24 hours to take a test because they want to do the best they can – too much variance
Likert Scales
When I am stressed, sometimes I get high.
A. strongly disagree
B. disagree
C. agree
D. strongly agree
Self-report measure
Behavioral Observation
The other end of the continuum
Best predictor of future behavior…
Issue of Obtrusiveness:
-Heisenberg uncertainty principle (observer principle)
–When people see that you’re paying attention to them, their behavior will change
-Hawthorne effect
–Turned the heat up – performance went up, turned the lights up – performance went up, turned the heat down - performance went up – WHY? Because people are observing their performance
Can be cumbersome with large N size
To capture behavior you must be there when it occurs
Naturalistic observation
Situational Judgment Test
The purpose is to identify a respondent’s intentions
Presents the person with a series of relevant incidents, and asks what he/she would do in that situation
The typical question is “ what would you do if …”
Often used to assess intelligence in a more “real world” fashion
Can assess a variety of constructs
Theory Based
Goal setting theory
Intentions or goals are the immediate precursor of a person’s behavior
Added benefit of content validity
Attitudes>Intentions>Behavior
Assessment Centers
Simulate the situation in which the individual will be performing
Predicts how successful that person will be in the actual situation
Exercises vary in fidelity and immersion
Assessment Center Examples
AT & T developed and operated the Advanced Management Potential Assessment Program (AMPA) for itself and the Bell System Operating Companies. The program was used by all the Bell System companies from 1979 through 1983.
Dr. Rich’s example of the study he conducted in the early 2000s where he and his team immersed executives in a situation in Baltimore – testing their adaptability – they were put into different situations all over Baltimore – e.g. they were told to talk to a guy about a problem – when they got to him – they realized that he was def – so some people just gave up since they couldn’t use sign language – others would grab a napkin and a pen so that they could communicate with him.
The CEO could then see who was needed at the company and who wasn’t – like the person who would give up when they couldn’t figure out a situation.
Psychometrics
RELIABILITY
If measurement procedures are to be useful, they must produce dependable scores
Consistency
Freedom from unsystematic (random) errors of measurements
Methods to assess reliability
Test Re-test
Parallel (alternate) forms
Internal consistency
-Split half
–Splitting a test in half – you can split the test anyway
-Kuder-Richardson 20-
–Test with a right and wrong answer
-Alpha
–Average of all split-half reliabilities
-Omega
Test Re-test is a good way to test reliability.
The downside to giving someone the same test twice: is the practice effect – will do better since they’ve taken it once.
Issues Related To Reliability
No fixed value that indicates acceptable
Reliabilities often range from .70 -.90
Range of scores (need variability)
-A range of scores is reliable
Sample size & number of items
-The more observation you have the more reliability you have
Reliability & Validity
Theoretically it would be possible to develop a perfectly reliable measure whose scores were completely uncorrelated with any other variable.
This measure would have no practical value.
It would be highly reliable but would have no validity.
Limit on validity
Validity is reduced by the unreliability in a set of measures
Ex. performance appraisal
-Typical reliabilities are low (.60)
-Sets a cap on possible criterion validity
-We can statistically correct for this type of unreliability
What is Validity ?
The extent to which a measurement procedure actually measures what it is designed to measure
Degree to which evidence and theory support the interpretation of test scores for their intended purpose
The investigation processes of gathering or evaluating data to asses this is called validation.
Really concerned with two issues 1. What a test measures 2. How well it measures it.
Validity
Tests scores are typically used to draw inferences about applicant behavior in situations beyond the testing environment
Test user must be able to justify the inferences drawn by having a cogent rationale or empirical support linking the test score to the inferred outcome
Nobody cares about the test score – what they care about are the consequences (inferences)
Validation Strategies
Content - Related Evidence
Criterion - Related Evidence
Construct - Related Evidence
Standards (1999, 2014)
Standards
Standards for Educational & Psychological Testing (2014).
Sources of validity evidence based on:
Test Content
Response Processes
Internal Structure
Relations to other Variables
Consequences of Testing
Content Validity
The content of the test is drawn from the domain of interest
Content Validation
Concerned with whether or not a measurement procedure contains a fair sample of the domain of situations it is supposed to represent
-Ex. suppose your first test had items drawn completely from texts that were not assigned for reading or covered in the lecture…
Our domain is usually job performance
Can also be other aspects of work, ex. Training proficiency
MUST provide evidence that a selection procedure samples knowledge or skills required for a job
MUST be based on accurate job information NEED A JOB ANALYSIS
MAY restrict job content domain to important or frequent activities (minimize the irrelevant)
In conducting a content validation study:
Content strategies are relatively data free
Need a panel of SMEs to rate each item on the relevancy to the job
Can be quantified (CVI)
Most of the inferences of validity are supported by the documentation surrounding the development of the test
Criterion Validity
The criterion variable is a measure of some attribute of outcome that is of primary interest
The choice of the criterion and the measurement procedures used to obtain criterion scores are of central importance
Companies often overlook good measurement during the criterion (use cheap easily accessible criterion)
Requires data – nothing complex, but it needs data – needs at least 100 subjects.
“G” = trait
If we get a statistically significant relationship, evidence of criterion validity
Feasibility
Job is reasonably stable and not in a period of rapid evolution
Relevant, reliable and uncontaminated criterion measure
Contaminated: Measuring things other than performance
Based on a sample that is reasonably representative
Statistical power
As you approach the statistical value of .3, the normal curve starts to form
Predictive & Concurrent Validity
P - data on the selection procedure are collected at the time applicants are hired - after employees’ performance levels have stabilized criterion data are collected. Applicant data on validated measure is not used in decision!
C - the predictor and criterion data are collected on job incumbents at approximately the same time
Construct Validity
Am I measuring what I intended to measure
Specifying the meaning of the construct
Distinguishing it from other constructs
Indicating how the construct should relate to other variables
Nomological Network
Conducting construct validation
Analysis of internal consistency
Factor analysis (establishing that items or item clusters share common variance)
Establishes that it is one construct
Correlations of a new procedure with established measures of the same construct (convergent validity) and with measures of unrelated constructs ( divergent evidence)
Establishes what that construct is
Construct Advanced Methods
Factor invariance
Does factor structure change when conditions change (when moderators are present)
Constructs/items are different around the world
They are interpreted differently
For example does factor structure change across cultures
Big Five versus Chinese Personality Assessment Inventory
Etic vs. Emic