Week 1: Psychometrics Flashcards
The quantitative assessments of latent (hidden/ concealed/ not yet manifested) psychological constructs are also known as:
Psychometrics
Precision and accuracy are examples of _____ properties (think very general)!
psychometric properties
Representativeness is a sample quality.
What does this quality define?
Define Kaplan’s paradox of sampling which relates to this.
Representativeness describes how well a sample reflects the population.
The paradox of sampling is that we can’t test representativeness, and if we could, we wouldn’t need a sample in the first place!
Name the quality which describes the degree of systematic (or random) error present in the sample. This quality can produce over or underestimates of population values, and comes in many different forms.
Biasedness
Describe the instability of psychological attributes present in the population with which the sample is drawn from. How is this measure expressed?
The degree of homogeneity/ non-homogeneity in the members of the population reflects the instability of that psychological attribute in that population. eg if the psychological attribute was stable then it would also be homogeneous
Standardised questionnaires are an example of _____ scoring whereas an assessors judgement of a vignette or a projective test is an example of ____ scoring
Standardised questionnaires are an example of objective scoring whereas an assessors judgement of a vignette or a projective test is an example of subjective scoring.
IQ is an example of a _____ score
standardised
Z-scores, T-scores and area transformations (quartiles, deciles, percentiles) are all examples of:
Standardisations and scale transformations
How would a sample size limitation effect errors inferred by the test results?
Errors tend to be inversely proportional to sample size, therefore a limited sample size could mean that a large error if inferred.
In psychometric testing, the degree to which a claim is correct or true is known as ____. This also reflects the appropriateness, usefulness or meaningfulness of test scores and their interpretations.
validity
The levels of logical biases or statistical errors in the test construction and conclusions/outputs will greatly affect the ____ of a psychometric test
validity
Why do assessments of validity constructs focus on scores/ data/ outcomes and functions?
Because they are measurable. If the outcome or function operates in the way that we claimed it would, then the construct is valid.
In statistics, unknowns are seen as _____, in the same category as mistakes
In statistics, unknowns are seen as errors
Why is construct validity also known as factorial validity? What does this validity type relate to?
Because all the constructs should fit one factor. Therefore the test is measuring the construct (or factor) that it claims to.
What is operationalisation, in regards to creating psychometric tests?
Operationalisation is a way of constructing psychometric tests which allows for empirical assessment of constructs or variables in the test.
High levels of correlation (statistical relation) between
(a) items that make up the same or related constructs,
or
(b) tests that measure the same or related constructs
describes which type of validity?
Convergence validity
Low levels of correlation between
(a) items that make up unrelated constructs, or
(b) tests that assess unrelated constructs describes which type of validity?
Discriminant validity
Concurrent validity and predictive validity are both examples of ____-____ validity. This validity type refers to the degree to which a test correlates with one more parallel or outcome citeria.
Criterion-related validity
Concurrent validity represents which kind of criterion-related validity?
Explain what long vs short or parallel forms of a personality measurement represents.
Concurrent validity represents criteria which are in the present.
Long versus short forms are different versions of the same assessment, with more or less questions (eg 500 items vs 50 items).
Parallel forms are different tests which use the same criterion eg both are measuring neuroticism.
Although psychological testing has existed since ancient times, the systematic approach currently used has only been developed over the past ___ years
100
Valid construct or assessment consistency across different settings, e.g. samples, populations, age, cultures, time-periods, etc, refers to ____ validity
external validity
The degree to which a test score reflects a construct’s or phenomenon’s natural behaviour in the world is known as ____ validity, a type of external validity.
ecological validity
The degree to which a relationship is not equal eg Nik can mark my assignments but I can’t mark his, is know as _____
asymmetry
The degree of confidence on the nature of asymmetric causal relations (treatment and outcome relationships), between the measured constructs is known as internal validity.
What does this mean?
Internal validity represents the degree of confidence on the nature of asymmetric relations between the measured constructs:
How confident are we that some constructs that we measure cause other constructs that we also measure?
When apples fall to earth,the term gravity is used to describe that behaviour. This is a good example of a ____
construct
Validity testing of a test using a select group of people (eg university students) to generalise a criteria back to the general population presents statistical problems. Please explain.
The correlation between test scores and criterion measures is restricted when the range of test scores is restricted eg. if all scores are above 60, the correlation will be clustered above 60, whereas without restricting that range, correlation will be spread between the lowest test scores and the highest level of critera
Why do naturalistic designs, such as questionnaires, interviews and participant observations have good external validity (EV) and bad internal validity (IV)?
Naturalistic designs tend to have good EV and bad IV because in the real world there are so many variables, errors and unknowns.
The less chance there is of confounding in a study, the higher the ____ validity
internal
How does a normative score relate to a standardised score?
A normative score is a standardised score which is ranked against the other scores
What are the following 2 examples of?
- A measure of job performance
- A GPA used to select employees.
They are both examples of criterion-related validity.
A concurrent criterion would be a measure of job performance.
A predictive criterion would be a GPA used to select employees.
An aptitude test followed by a future job performance test to test the validity of that aptitude testis an example of what type of criterion-related validity?
An aptitude test followed by a future job performance test to test the validity of that aptitude testis an example of predictive (criterion-related) validity.
A psychometric measure designed for a clinical population may not have good external validity. Explain why this is a good thing.
We may not want external validity outside of the clinical population in questions, we may want the questionnaire items to measure that population only.
What is an asymmetric causal relationship?
Give 2 examples.
An asymmetric causal relationship is an irreversible causal relationship. Eg. A causes B, and the process cannot be reversed by eliminating or reducing A.
Another example is that when Nico marks our exams, he will have an effect on the students but they will have no effect on him eg A causes an effect but A is not effected by that effect.
What is internal validity?
The degree to which asymmetrical causal relationships are consistently represented in the measure. eg. A construct represented by some item measurements consistently causes another construct represented by other item measurements.
Why does Naturalistic design have good external validity but poor, or sometimes to internal validity?
Naturalistic designs have good EV but bad IV because the natural world is so unpredictable, it’s unlikely that the asymmetrical causal relationships will stay constant.
Why do experimental designs have good IV (internal validity) but bad EV (external validity)?
Experimental designs have good IV because the conditions in an experimental design are controlled, and bad EV because the conditions in an experimental design are unlikely to be mirrored in non-experimental conditions.
What is content validity?
Content validity is reflected by scores or test outputs representing the content area or domain which they claim to.
Sampling bias, cluster bias, systematic error (accuracy bias) and the ceiling/floor effects represent what type of validity?
Sampling bias, cluster bias, systematic error (accuracy bias) and the ceiling/floor effects represent content validity
What are ceiling floor effects and what might be causing them?
Ceiling/ floor effects represent when the responses are either clustered at the bottom or the top of the score values. One possible cause could be that the scale is too narrow.
- The degree of consistency or stability of measurement scores across time or context
- The degree of absence of construct fluctuations that are unaccounted by the measurement’s scores (output)
- The degree of random error (unreliability) in the observed variability (changes) of measurement scores
These all represent what quality of psychometric tests?
Reliability.
Classical Test Theory (CTT), can be represented by the following equation: X = T + E
What does the equation mean?
CTT purports that people (objects/entities) have a true score (T) whereas measurements (and individuals) have errors (E).
An observed score (X), is the sum of the true score + error
The equation X = T + E can also be expressed in terms of variance (σ^2), in the following equation: σ^2X = σ^2T + σ^2E.
What does this mean?
The variance of an observed score (squared) = the variance of the true score squared + the observed score squared.
In classical test theory (CTT), a theoretical reliability of a measurement could be expressed via the reliability index (r) as r = σ^2T / σ^2X (reliaility= variance of the true score/ variance of the observed score).
If r=0.9, how much of your observed score is representative of your true score, and how much is error?
If r=0.9, 90% of your observed score is your true score and 10% is error
Idiosyncratic and generic are 2 sources of individual measurement error. Elaborate on these.
Idiosyncratic = language/ mood/ fatigue/ memory
Generic= lying to yourself or others (faking: Desirability, impression formation and self deception)
The assumption that error (eg test retest) is random, or ‘noise’ is a questionable assumption is psychological testing because…. This is a problem with _____ testing CTT
Psychological circumstances may have changed between initial and following test. This is a problem with reliability testing in CTT
The multiplicity problem with reliability testing in classical test theory (CTT) refers to:
Multiplications are interactions eg biological interactions with psychological well-being. However, CTT asserts that all errors (variation) can be added to ‘true score’, rather than multiplied.
The concept of a true score relies on the assumption that a (psychological) construct exists.
This is an example of an issue with which theory?
CTT (classical test theory)
Data metrics: Variables create columns, people create rows. This can be transposed into q metrics, so that:
Therefore, analysing individuals sources of measurement error is also known as _ analysis
People create the columns and variables create the rows
Metrics are always which type of data?
Metrics are always quantitative data
Endophenotypic error is another word for which type of idiosyncratic individual measurement error in psochometric testing?
Endophenotypic error is another word for memory error
Acquiescence and nay-saying bias, 2 of the generic individual sources of measurement error exposed using q analysis, refer to:
Acquiescence bias refers to agreeing with what is suggested in the test, whereas nay-saying bias is the opposite, disputing everything suggested.
Random responses and mid-point or extreme responses (floor/ ceiling) are all examples of which type of Q-analysis source of individual measurement error?
They are all examples of generic sources of individual measurement error
Content-related, format-related and administration-related errors are all types of which type of measurement error (also known as R-analysis)?
Content-related, format-related and administration-related errors are all types of item/scale measurement error (rather that individual/response error)
The degree of homogeneity in responses to scale-items which measure the same construct reflects which type of reliability?
Internal consistency reliability
Cronbach’s alpha coefficient is a _____ coefficient.
rα= 0 means that only ____ variance is present in the measurement, whereas rα= 1 means that only true scores are present
Cronbach’s alpha coefficient is a reliability coefficient.
rα= 0 means that only error variance is present in the measurement, whereas rα= 1 means that only true scores are present
Test-retest reliability is also known as ____ reliability
Test-retest reliability is also known as temporal reliability
Dropouts / non-response rates (bias)
Temporal instability of constructs
Optimal time-interval
These are all issues which could effect the ____reliability coefficient
temporal
Inter-rater reliability, which is measured using Cohen’s Kappa coefficient refers to which process involving experts?
Inter-rater reliability (Cohen’s Kappa Coefficient) refers to the process of at least 2 experts rating the reliability of test measures. Their responses are then compared.
An index of the average degree of random error is known as _ _ _
SEM (standard error of measurement) is a measurement of the average degree of random error
± 1 SEM, ± 2 SEM, ± 3 SEM refer to standard error measurements from the observed (mean) score, which represent confidence intervals that the true score sits within those SEMs. What are the intervals?
There is a 68% chance that the true score lies within ± 1 SEM of the observed (mean) score, a 95% chance that the true score lies within ± 2 SEMs from the observed score, and a 99% chance that the true score lies within ± 3 SEMs from the observed score.
What is the difference between validity and reliability?
Validity represents the degree to which scores represent the variable being measured, whereas reliability refers to the consistency of that measurement across time (temporal), test items and researchers.