Psychometrics Final Flashcards
What is reliability a measure of?
- The consistency and stability of measurement over time
- time is not always a factor
What is the basic idea behind true score theory?
- Observed scored = true score + random error
- Postulated idea on how reality works
- Foundation of assessment
- Must hold true for measurement to work
- We never observe the construct itself
- We are assuming observable behavior relates to underlying psychological constructs
- The observed score is a constant, so the greater the error we will see a reduction in the trueness of that observed score
What are the two different types of error discussed in class and how do they differ
Random error
- Will impact everyone but it will not be in the same direction.
- Does not affect the mean, but will murky the water (because they are all in different direction)
- Increases the variability around the average
- “Noise”
- Will always occur in measurement
- Will effect everything in a sample individually
- Will impact everyone differently
- You will always have random error
Systematic error
- Will impact everybody in a similar or singular way. They will impact in the same direction but at different levels of impact.
- Changes the average
- Called “Bias”
- There is not always systematic bias because it can be controlled for
Which form of error causes a change in the mean (average) score observed?
- Systematic error causes a change in the mean score
Why is one form of error called bias and the other called noise?
- Systematic is called bias
- Affects the average
- Random error is called noise
- This makes the picture murky, harder to pick up true ability
What are some ways to reduce measurement error?
Pilots
Thorough
Double
Checking
Saves
Multiple-Lives
Pilot testing
- Way to have some foresight that we won’t normally have
- We can often times identify potential systematic errors
- Addressses systematic error and random error
Thorough Training
- Especially when you have multiple individuals collecting data or multiple raters
- Data is ambiguous, subjective, or open to interpretation
- Addressses systematic error and random error
Double Check the Data
- Not just plausibility, but possibility as well.
- Possibility is easy to check in SPSS
- Plausibility is more difficult and arguably more important.
- Ex: Incredible effects that were not quite expected when she did not reverse coding scores
- You need to really know the construct and the expectations but do not let numbers dictate your thinking, you need to use your knowledge and judgment as well
- Addresses systematic error
Statistical Correction
- Can be simple (mean) to complex (statistical adjustment)
- Adjust mean
- Ex: mean score adjustment when afternoon class scored lower on exam because of construction
- Statistical modeling of error
- Addresses systematic error
Multiple Measures
- Administer multiple measures of the same construct, you can triangulate between them to look for systematic error (method bias).
- Ex: assessing intelligence in children. You can ask parents, teachers, and a formal assessment. If you have two corresponding and one that is significantly different than the others, then the one that is different could be a systematic bias.
- Addressses systematic error and random error
Know the differences between the following:
- Inter-rater reliability
- Test-retest reliability
- Alternate-form reliability
- Internal consistency reliability
Inter-rater reliability
- 2 independent raters
- 1 time
- 1 assessment
- Assessess the raters' amount of agreement
- Categorical
- Percent of shared agreement between raters
- Continuous
- Correlation between the observers
Test-retest reliability
- 1 rater
- 2 points in time
- 1 assessment
- Used to assess consistency of a measure one time to another
- Timing is critical
Alternate-form reliability / Parallel Forms
- 1 rater
- 1 point in time
- 2 assessment
- Used to assess consistency of same knowledge base across 2 assessments
Internal consistency reliability
- 1 rater
- 1 point in time
- 1 assessment
- Looks at consistency across items within the measure at the same time
- 3 types
- Average inter-item correlation
- Split-half reliability
- Cronbach's alpha
Inter-rater Reliability
Inter-rater reliability
- 2 independent raters
- 1 time
- 1 assessment
- Assessess the raters' amount of agreement
- Categorical
- Percent of shared agreement between raters
- Continuous
- Correlation between the observers
Test-retest Reliability
- 1 rater
- 2 points in time
- 1 assessment
- Used to assess consistency of a measure one time to another
- Timing is critical
Alternate Form Reliability
- 1 rater
- 1 point in time
- 2 assessment
- Used to assess consistency of same knowledge base across 2 assessments
Internal consistency reliability
nternal consistency reliability
1 rater
- 1 point in time
- 1 assessment
- Looks at consistency across items within the measure at the same time
- 3 types
- Average inter-item correlation
- How well each item compares to other items
- Correlate all items to the other items
- Should not see a lot of variatsion
- Assume unidimensionality
- Tells you which item are problematic
- Split-half reliability
- Randomly select half of the scores, sum them, run a correlation
- Cronbach's alpha
- Do every split half, and average them
- Most stable
- Tells you how much of a problematic effect an item has
- Average inter-item correlation
Average inter-item correlation
- Part of intercal consistency reliability
- How well each item compares to other items
- Correlate all items to the other items
- Should not see a lot of variatsion
- Assume unidimensionality
- Tells you which item are problematic
Split-half reliability
- Part of internal consistency reliability
- Randomly select half of the scores, sum them, run a correlation
- If you have a lot of poor items it will show a bad correlation
- With short measures with a few items one bad item can throw it off
Chronbach’s Alpha
- Part of internal consistency reliability
- Do every split half, and average them
- Most stable, preferred method
- Tells you how much of a problematic effect an item has
systematic error
• No shift in variability • Affects the average • This is called the bias • Random error does not always have systematic error • When we have systematic error, we will always have random error
Distinction between Validity and Reliability
- Reliability
- Consistent measurement over time, raters, or forms.
- Validity
You are measuring what you think you are measuring - Much more important than reliability
- You could reliably measure the wrong thing
- A measurement is considered valid when the test overlaps with the constructs of interest
- you can have something that is reliable and not valid, but not the other way around
- scale example
Why is it imprecise to say that a test is valid?
- Validity is a matter of degree.
- No test is valid or not valid.
- “If a test is not valid then it does not exist”
- Measurement being valid is not a yes or a no, but to what degree
- construct underrepresentation
- construct irrelevant variance
- Valid measurement
- How does increased or decreased validity play into the relationship between these?
construct underrepresentation
- the aspects of the construct that our test does not tap into
- and what we do not know
construct irrelevant variance
- factors that influence responses on the test that go beyond the actual construct itself
- includes random and systematic error
- pulling away from a perfect true score
- The test
Valid measurement
- the overlap between construct underrepresentation and construct irrelevant variance
Increased validity-
- they overlap more
- minimizing construct underrepresentation and construct irrelevant variance
Decreased validity
- the overlap less
- construct underrepresentation and construct irrelevant variance are pulling away from the valid measurement
*
Content validity
- Examination of aspects of the test itself to ensure that we have as much overlap as possible with the construct
- How well our measurement is tapping into the actual construct
- Good match between test content and the domain = High content validity
- Conceptual as opposed to statistical
How to assess content validity
Describe the content domain
- In terms of boundaries and structure
- Boundary refers to all boundaries or aspects that are operative within the construct.
- Structure has to do with the relative weight or importance of each of these aspects within the construct. (how important each of them are)
Determine the areas of the content domain that are measured by each test item
- Do not want a single item to tap into multiple constructs
Compare the structure of the test with the structure of the content domain
- Is our measure representing the construct appropriately?
- First, we can have a relative number of items within the measure that reflects the weight within the construct
- Or, we can create our scoring rubric to reflect the structure of the construct.
Additional info:
you can score to adjust the structural issues
for a boundary issue the weights will change
for a large boundary issue the test may need to change
this is not a statistical process
Construct validity
- Examination of the relationship between test scores and those of other measures.
- This is a statistical process where we look to see the amount of overlap.
Differences between content and construct validity
Why is an assessment of content validity seen less in psychological measurement than in educational testing?
- Educational content is much less ambiguous than constructs we deal with in psychology
- It is a confirmatory process rather than statistical
Methods of determining construct validity
Correlational study
- Correlations between measures of certain behaviors (that are either related to or unrelated to our construct of interest) and our test
- Generally done, tends to be the preferred approach
- Convergent validity
- Discriminany validity
Factor analysis
- Analyzing which groups of items “hang” together
- Work best when dealing with constructs with multiple aspects
Experimental manipulation
- Manipulate the construct of interest (i.e. induce fear) and see if it relates to different scores on our test
- Why don’t we do that all the time? We can’t assign to suicide and trauma conditions