First Exam Flashcards
Reliability
How consistent is the entire instrument, the closer it is to 1, the more reliable the instrument is
Psychometric theory looks at 2 things
The entire test (reliability) and the other side looks at item quality (non-dichotomous & dichotomous)
How do you construct an instrument?
- looking at the entire test (reliability) and item quality (non-dichotomous & dichotomous)
Entire test had 4 different types of reliability
-inter-rater, test-retest, internal consistency, parallel forms
Non-dichotomous and how it relates to variance
-you want higher variance to have a better normal curve, the more items you increase the variance
Validity
Accuracy, all of probability is based on infinity
Reliability and error
Error can affect the consistency of scores
2 types of error
Systematic error & random error
Systematic error
Errors that occurs consistently because of a particular characteristic of the person being tested (reading proficiency)
Random error
Errors that occur by chance (black out, distraction) (distraction) (more common)
Different types of random error
Content differences, subjective scoring and temporal instability
Content differences (content based)
Non-standardized administrations (may inadvertently speak differently when administering test) ex: court ordered testing or a child using restroom during test
Subjective scoring
Raters difference- subjective viewing of the client maybe different
Temporal instability
One day test taker had the flu ex: things change day to day, the first day of testing went good, but second day there was an earthquake their performance went down
What are some ways to decrease measurement error
Writing clear items, making test instructions easily understood, adhering closely to the prescribed conditions for administering a instrument, training raters on themselves, make subjective scoring rules as explicit as possible
Where does most measurement error come from?
It has to do with the person administering the test, but it will change as one becomes more experienced
Test-retest reliability (coefficient of stability)
When you take a single group of subjects and you repeatedly test on the same instrument at different times
What is the gold standard for test-retest reliability
2 weeks between the first test and the second test, this is where you get optimal test-retest reliability
In test-retest reliability what is the difference between the shorter and longer gap?
Longer the time gap lower correlation, shorter the gap we get more similar factors that contribute to the error
Artificial inflation
When researchers use the shorter gap to get a better correlation
Parallel forms reliability
Assessing if two forms of the same instrument produce similar results when testing the same person (sometimes hard to achieve)
What is form A & form B (parallel forms reliability)
How reliable they are with one another, the have these two spots to eliminate the practice affects
What is a key problem with parallel forms reliability?
Difficult to randomly divide and hard to create large number of items
What is a key problem with parallel forms reliability?
Difficult to randomly divide and hard to create large number of items
What is a key part of parallel forms reliability?
Developing a large number of items and then randomly divide them into test
Coefficient of equivalence
How correlated the scores are of a persons taking similar tests with two different forms of
When should the two forms for parallel forms reliability be sent out?
They should be administered at least 2 weeks apart
What happens if correlations between two testing is lower than .2?
There is significant measurement error
What happens if you administer the forms on the same day for parallel forms reliability?
test may reflect state rather than trait and you will not have a statically significant difference
Internal consistency reliability
How related items are within the entire scale and within the subscales
What do we want with internal consistency reliability?
The content should be similar for the reliability to be high, you need adequate number of items and want the item to underlie appropriately a particular construct
Different types of internal consistency reliability
Split half reliability, Kudar Richardson #20 (KR 20), Cronbach Alpha
Split half reliability
Split the examinees scores into halves and then correlate the scores of both halves
How does split half reliability look like in speeded tests?
May produce artificially high internal consistency for odd and even split, if he/she runs out of time
How to get good idea of split half reliability
They will take the odd questions and split those in half with the even questions, this will allow a better idea of split half reliability
What are some problems with split half reliability?
Natural order of test taking (content is not the same with the first half as the second half) & Issue of a timed test (some people don’t get to the second half)
Kudar Richardson #20 (KR 20)
Formula that allows for split half reliability that is done under the assumption that the questions are scrambled
How does Kudar Richardson stop a confound in your test?
By stopping the natural order
The drawbacks of KR 20
Only works with dichotomous scaling systems (only allows for right or wrong question responses)
Cronbachs Alpha
Can be used to assess internal consistency for those tests that have different scoring systems
When can and how can cronbachs alpha be used?
Can be used on any scoring system and allows for scrambling of the questions, used more than any other measure of internal consistency, equivalent to all split half correlations
Internal consistency & cronbachs alpha
High coefficient alpha does not always mean that you are measuring only one factor or latent construct (unidimensionality)
What do we assume in internal consistency?
We assume there is unidimensionality but more tests are inadvertently dimensional or multidimensional
What do dimensional or multidimensional tests look like?
It can mean more than one factor is being measured (ex. AP history test measures knowledge, but also writing ability)
How will cronbachs alpha be increased or artificially inflated?
If test takers are homogeneous group, need heterogeneity in the group (it will be more accurate if it is a general group of people)
Interrater reliability
Assessing the degree of consistency between multiple raters
2 kinds of interrater reliability
Kendall’s coefficient of concordance & Cohens Kappa
Kendall’s coefficient of concordance
Degree of consistency amongst raters that rank order people/objects
Rank order consistency: miss universe, rank people in an order of 1,2,3,4,5 among different judges to see if they correlate with another
Cohens Kappa
Degree of consistency amongst raters that classify items into discrete categories
Example of cohens kappa
Assessing the same group of 30 people between two different raters, cohens kappa will identify who places which patients in depressed or not depressed
Normal Curve
The probability that an observation under the normal curve lies within 1 SD of the mean is approx 0.68 & 2 SD of the mean is approx 0.95 & 3 SD of the mean is approx. 0.99
Why is SEM important for testing?
SEM is based on the idea that your cannot test an individual infinite amount of times. Standardized error is always present
How is JND difficult to apply to psychological constructs?
It is used to determine a level of sensory difference (like hearing or sight) and it is variability in the expression of disorders in humans
What is item analysis and how is it related to test construction?
Examining the item quality to map the construct we have defined. We then look at dichotomous and no dichotomous measures to determine item quality (variance, covariance, etc.)
How does one construct a test?
Need to determine what area or domain you want to examine; homogenous content; tests made for repeated use require validation
Scaling models
Unidimensional, subject centered methods, stimulus centered methods and response centered approaches
Subject centered methods
Test developers primary interest is locating the individual at different points on the continuum (likert scale)
Stimulus centered methods
Psychophysics & JND- give tones to determine what is the absolute threshold to experiencing a sensation, but not always clear where the difference lies, not all of us agree on what the difference is. Need subject competency to tell the JND
Response centered approaches
Each respondent is asked to rank order his or her preference for a set of stimuli or to rank order a set of statements in terms of their proximity to his or her own personal beliefs. Allows to scale psychological distance between separated categories
Heterogeneity
Difference in character or content
Heterogeneity
Difference in character or content
Homogenous
Same character or content
Meta-analysis
Multiple studies with the same research questions
Bivariate
Split a variable into 2 parts
Inferential statistics
Take sample data and make inferences on the population
Descriptive statistics
Looks at trends in the sample and understand them based on the sample itself
Assessment
Is an overall-testing score in the context of history (holistic)
Testing
Is a quantitative score no larger context
Niche building
Create, seek out and end up in environments that reinforce your traits, do this consciously and unconsciously
Reliability and standard error of measurement
As reliability of the instrument increases the standard error of measurement goes down, if you know your test is getting consistency then of course your error will go down
Achievement tests
They are trying to determine if a specific skills set or knowledge base has been acquired
Popham & Husek (1969)
Learned that you cannot use traditional reliability since you are not interested in how someone does in comparison to a group of others— you are interested in how someone performs in regard to a specific criterion
Criterion
Anything that has real world implication
Ex: a lawyer fails bar exam they cannot become a lawyer, these affect your real like because they affect you moving forward in a profession
2 objectives achievement tests scores can give you
Relative position of the examinees score in a distribution of scores (z score) & the degree to which the person has attained the goal of a specific instruction (ex: comp exam)
Z score
Measured in terms of standard deviations from the mean, Relative position of the examinees score in a distribution of scores
Proportion correct score
Percentage of correct answers from a randomly determined number of test items (you don’t need to know how others performed if you know the percentage of correct answers obtained)
Criterion referenced tests
Look at development (all of these tests are arbitrary), a test that measures a student’s performance against a set of predetermined standards or criteria.
Domain score
The proportion of items in the domain that the examinee answers correctly
Mastery allocation
Cutoff score that classifies examinees into two categories master vs. non-master (ex: EPPP)
What does a z score allow for
Allows for comparison across variables that are calibrated or scaled differently, it is independent of scaling and calibrated
what do z scores do for the WAIS/WISC (IQ tests) & MMPI
These both have different scoring which makes them not comparable, but the z-score is able to compare them
Absolute error
Using an examinees score as mean score as a representation of his true universal score
How is absolute error calculated
By summing all the error variance
How criterion referenced reliability is examined
The lower the error the better the examinees score represents his domain referenced true knowledge
Reliability of classification
-does the observed match on to what we predict, we want to know who passes and fails as well as how they are classified
Predicted
What we observe to happen
Observed
Is what has happened
The percent that the items people are getting correct will affect
The reliability of achievement tests
Where is the true reliability
It is in the middle, not the tails of the scoring, we want the middle to be lower to show a better reliability
Should reliability be high?
Yes and all similar
Homogenous samples and reliability
They have less reliability compared to heterogeneity
Self report and reliability problems (2 major components)
Literal meaning and pragmatic meaning
Literal meaning
Semantic understanding of sentence structure
Pragmatic meaning
Inferences about the questions intent
Issues with reliability & self report
Ex: how are you doing? Leads to interpretation by the participant in the conversation, this can cause issues with reliability because the client may interpret the question differently
Self reports and reference periods
When asked to respond to something that occurred last week vs. last year, find differential responding
Differential responding
There is an interpretation that the shorter the length implies frequency and longer the event more intensity
Self reports and question context
Respondents change their answers based on researchers affiliation, or response categories themselves can change the way a patient may respond
Self report and context
Preceding questions in a survey or questionnaire influences the ways in which respondents evaluate items
Internet & psychological testing
Internet provides a cheaper and faster way to update tests, translate tests, interpret scores quickly, can get more respondents quickly, can provide access to test materials quite cheaply, allows those in rural areas to be tested
Internet and ethical considerations
Test security, keeping the testing items secured, test may discriminate, language barriers, minors taking tests, not giving informed consent accurately, how do you give feedback to individual, how do you deal with emotional trauma from results
Psychologists should use what type of tests?
Tests whose validity and reliability have been established for the population being tested
How do you evaluate an item or test question is good?
Done through statistical analysis of the test questions
Intrinsic traits
qualities that are inherent to something or someone, and are not dependent on external circumstances
Difference between multidimensional and unidimensional in cronbachs alpha
Multidimensional has lower cronbachs alpha, uni has higher