Chapter 3: Reliability and Validity Flashcards

1
Q

What is reliability?

A

The stability or consistency of a test
Tells us if the test provides good measurement
Tests are used in decision-making

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is consistency in a test
important?

A

Because an inconsistent test means:
Our test doesn’t provide a good measure of stable traits or attributes.
basically we could end up making bad decisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is classical test theory?

A

A psychometric theory of measurement
Most commonly used approach to measurement in psychology.

x = T + e
x is observed score on the test
T is the true score
e is the error of measurement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Classical Test Theories equation for error

A

e = x - T

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Assumptions of Classical Test Theory

A
  1. The mean error of measurement is 0
  2. True scores and errors are uncorrelated
  3. Errors on different measures are uncorrelated
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Test-retest method

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Methods of Estimating Reliability

A
  1. Test-retest method
  2. Parallel forms
  3. Split-half methods
  4. Internal consistency methods
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Test Re-test Reliability

A

Give the same group of people the same test at two different points in time and then correlate the scores by computing a correlation coefficient. (reliability coefficient is thought of more as a stability coefficient)
Measures the stability of scores over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Pearson Product Moment Correlation (r)

A

The most common correlation coefficient.
It is used when two sets of scores are continuous and normally distributed.
Correlation Coefficients can vary from 0 (no relationship) to +1 or –1 (perfect positive or negative relationship)

We need a .70 or above for reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Test-retest methods: Error & Issues

A

Error is due solely to measurement error
Some issues:
– Carryover effects (interval between tests)
– Memory
– Stability of construct
– Factor of fatigue
– Reactivity (people may learn about the topic between tests)
- Motivation (ppl may not be motivated on the test when taking it a second time)
- it is difficult to determine a suitable interval of time between tests (if you wait too long the person could have changed but if you do it too soon then there would be carryover effects)

Problems with method:
– Time-consuming
– Expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Alternate Forms Reliability (aka equivalent forms)

A

Give a test to a group of people, then after a suitable amount of time give them a different form of the test, then correlate the scores.
Has to be administered either at different times or in succession
Half must take test A then B and half must take B then A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Alternate forms methods: Error & Issues

A

Error due to test content & perhaps passage of time (if not give back to back)
Some issues:
– Need same number and type of items on each test
– Item difficulty must be the same on each test
– Variability of scores must be the same on each test
- item sampling
- temporal aspects

Developing an equivalent alternative test can be extremely time consuming and sometimes impossible.
Example: can easily come up with equivalent tests to assess math knowledge but it is near impossible to come up with two equal tests that assess depression because a limited number of items relate to depression while there are infinite math questions you can ask.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Alternate forms methods: Bonuses

A

Bonuses
– Shorter interval
– Carryover effects are lessoned
– Reactivity is partially controlled

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Split Half Methods

A

Give the test to a group of people, split it in half (usually odds and evens) , then correlate the scores
Concerned with internal consistency
Determines to what extent the test is composed of homogeneous items.
Some psychologist think tests should be homogeneous while others don’t care if they are homo or heterogeneous, they only care how well the test works

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The reliability of the split half method

A

From the viewpoint of item sampling (not temporal stability), the longer the test the higher will its reliability be

The Spearman-Brown formula: (allows us to estimate the reliability of the entire test from split-half administration

estimated r = [ k (obtained r ) ] / [ 1 + (k − 1)(obtained r ) ]

k is the number of times the test is lengthened or shortened

For split half tests, k is 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Split-half methods: Error & Issues

A

Error due to differences in item content between the
halves of the test
Some issues:
– Deciding which split-half reliability estimate to use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Split-half methods: bonuses

A

Bonus:
– Carryover, reactivity, and time are minimized

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

The Rulon Formula

A

Alternative to Spearman-Brown formula
estimated r = 1 − variance of differences / variance of total scores

Four scores are generate for each person: Odd items, Even items, Difference (odd – even), Total (odd + even)

If scores were perfectly consistent then there would be no variance so the “variance of differences” would be 0
R would = 1
The ratio of the two variances reflects the proportion of error variance, when this is subtracted from 1 we get the proportion of “true” variance aka the reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why do we want Variability and how do we increase it?

A

Variability of scores among individuals, that is, individual differences, makes statistical calculations such as the correlation coefficient possible.

For greater variability increase the range of responses and create a test that is not too easy or too difficult.
The number of items – a 10-item true-false scale can theoretically yield scores from 0 to 10, but a 25-item scale can yield scores from 0 to 25, and that of course is precisely the message of the Spearman-Brown formula.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Internal Consistency Methods

A

Examines the items.
Give the test to a group, then compute the correlations among all items and compute the average of these intercorrelations, use a formula like coefficient alpha to estimate the reliability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Two assumptions of Internal Consistency Method

A

First, the interitem reliability, like split-half reliability, is meaningful only if the test is made up of homogeneous items that all assess the same thing

Second, if each item is perfectly reliable then we would only obtain two test score
Example: on a 100-item test you should get a 0 or a 100

In the real world items are not perfectly reliable of consistent with each other, which results in individual differences and variability

22
Q

Types/measuring of internal consistency tests

A

Estimates the reliability of a test based on the number of items in the test (k) and the average
intercorrelations among test items.
– Coefficient Alpha: Calculates the mean reliability coefficient one would obtain for all possible split halves
Most widely use method of internal consistency
Only requires 1 test administration
Included in most statistical packages
Example: the response “never” is given 5 points and “occasionally” is given 4
Suggested that it should be .80 to be reliable (sometimes too harsh on short test b/c reliability increases as number of items increases)

– Kuder-Richardson Formula 20 (K-R 20) : Used with dichotomous items
(right/wrong, true/false, yes/no)

23
Q

Takeaway points on reliability

A

– No such thing as “the” reliability; Different methods assess
consistency from different perspectives
– Reliability coefficients apply to data, NOT the instrument
– Any reliability is only an estimate of consistency
– Depends more on what one is trying to do with the test
scores than on the scores themselves

24
Q

What do all the methods of reliability stem from?

A

all stem from the notion that a test score is composed of a “true” score plus an “error” component, and that reliability reflects the relative ratio of true score variance to total or observed score variance; if reliability were perfect, the error component would be zero.

25
Q

Generalizability Theory (G theory)

A

– Developed by Cronbach (1972)
– CTT is often referred to as a “parent” of G theory
– Error can come from a variety of sources
– Try and systematically vary sources of error and study the error experimentally

A second approach to reliability.
Does not assume that a person has a “true” score on intelligence, or that error is basically of one kind, but argues that different conditions may result in different scores, and that error may reflect a variety of source.

26
Q

G-Theory: Lyman (1978) five major sources of error

A

The individual taking the test
The influences of the examiner
The test items
Temporal consistency (intelligence is stable, mood is not)
Situational aspects (noise)

27
Q

CTT vs. G Theory: Main Advantage of G Theory

A

– G theory allows us to disentangle sources of error
– Separates error into systematic error and random
error
– Focus on what types of conditions can we expect these results to generalize to

– CTT just has random error component
– Focuses on our ability to generalize from one set of measures to a set of other plausible measures

28
Q

Scorer Reliability

A

– Compute a correlation coefficient to indicate percentage of agreement
An objectively scored test could have a very high reliability
– If your test can’t be objectively scored, you need to train your scorers
A subjectively scored test is limited by the scorer reliability
To improve reliability you can use test items that can be objectively scored

29
Q

Rater Reliability

A

– Same as scorer reliability, but now dealing with ratings
– Want to make sure that raters agree above chance
For example, suppose that two faculty members independently read 80 applications to their graduate program and rate each application as “accept,” “deny,” or “get more information.”

30
Q

Interobserver Reliability

A

Determine the level of agreement between the two observers
Percentage agreement = [ (A + D) / (A+B+C+D) ] ×100

Coefficient Kappa: Po - Pe / 1 – Pe
Po is the observer proportions of agreement
Pe is the expected or chance agreement

31
Q

Correction for attenuation

A

– Use statistical means to estimate if we had a perfectly reliable test

When reliability is less than perfect we say there is “noise in the system”
There are statistical ways to remove the “static” called correction for attenuations
r-estimated = r12 / √r11r22

r-estimated is the “true” correlation between two measures is both the test and the second measure were perfectly reliable
r12 is the observed correlation between the test and the second measure
r11 is the reliability of the test
r22 is the reliability of the second measure

32
Q

Standard error of measurement

A

– Index of the amount of inconsistency or error expected in an individual’s test score
– Used to assess reliability from the individual point of view
– By calculating this value, we can estimate the probability of an individual’s score falling within a certain interval
– As the standard deviation decreases and the reliability coefficient increases, the SEM is smaller

Knowing the reliability coefficient for a particular test tells us the stability of the test

If we knew the test-retest reliability was .92 over a 6 month period then we could conclude that the measure is fairly stable over that period of time

The psychometrician is more interested in the test than in the subjects who took the test

The person that uses the test (teacher, clinical psychologist, etc.) cares more about the individual

These people assess reliability from the individual point of view by computing the standard error of measurement (SEM)

If you test someone many times and then take the mean of all their test scores, the mean will be the “true” score because error deviations are assumed to cancel each other out (for every lucky guess there is an unlucky guess)
It is sometimes not possible to have someone take their same test enough times to determine a SEM therefore we have a formula to help us get an estimate

SEM= SD √ (square root) 1-r11
SD is the standard deviation
r11 is the reliability coefficient

The smaller the SD the smaller the range of the probability of the “true” item

33
Q

SE of differences

A

– Compare test scores from two different measures
– Uses the SEM for both measures
– Calculate how different the scores need to be to judge one score as better than the other

If a student received a 108 on a math test and a 112 on a spelling test we cannot conclude that she did better on the spelling test because there is “noise” (aka unreliability) on both test. The spelling test could be 105 or 107. The Math test could be 113.
In order to compare her two scores in a reliability framework we can use the standard error of differences (SED)
How much the scores deviate on average

SED= √ (SEM)^2 1 + (SEM)^2 2

Which is equal to

SED= SD √ 2−r11−r22

the first SEM and the first r refer to the first test and the second SEM
the second r refer to the second test

34
Q

Reliability of Difference Scores

A

If we are more interested in the relationship of pairs of scores rather than individual scores we must inquire into the reliability of difference scores:

r-difference = [ 1/2(r11 + r22) − r12 ] / [ 1−r12 ]
r11 is the reliability of the first test
r22 is the reliability of the second test
r12 is the correlation

If the reliability of each of the two tests is similar then the reliability of the difference (r12) lowers rapidly
The point here is that we need to be very careful when we make decisions based on difference scores. We should also reiterate that to compare the difference between two scores from two different tests, we need to make sure that the two scores are on the same scale of measurement; if they are not, we can of course change them to z scores, T scores, or some other scale.

35
Q

What is Validity?

A

– Does the test measure what it is intended to measure?
– Is it valid for the purpose we intend to use it for?
– We discuss different types of validity, but it all falls under one umbrella.
– We want to make an overall evaluation of the interpretations of the test
validity is best thought of as a unitary pro- cess with somewhat different but related facets

36
Q

What are all the types of validity?

A

Content validity
-face validity
Criterion validity
-concurrent validity
-predictive validity
Construct validity
-convergent validity
- divergent validity

37
Q

Content Validity

A

Content: How well does a test measure (representative) every element of the attribute?
Content validity refers to the question of whether the test adequately covers what is being measured and is particularly relevant to achievement tests.
Need to make sure the test is truly representative and relevant to what we are testing
– Two ways to assess
– Subjective – ask experts (SMEs) to judge relevance and representativeness of the items
– Empirical – use statistical methods

Consider a test in this class that will cover the first five chapters. Should there be an equal number of questions from each chapter, or should certain chapters be given greater preeminence? Certainly, some aspects are easier to test, particularly in a multiple-choice format. But would such an emphasis reflect “laziness” on the part of the instructor, rather than a well thought out plan designed to help build a valid test?

Taxonomies
Helps achieve content validity by carefully planning the tests construction
Mostly used in educational test

Face Validity:
When a test appears valid to the people taking it
Concerned with how test takers perceive the attractiveness and appropriateness of a test
A test could “appear” valid, but may not be valid

38
Q

Criterion Validity

A

Criterion: How well does a test predict some external criterion measure?
– Match test scores with an independent criterion
– Types of criteria: Contrasted groups, GPA, Worker performance ratings
If a test is said to measure intelligence, we must show that scores on the test parallel or are highly correlated to intelligence as measured in some other way – that is, a criterion of intelligence
A test can never be better than the criterion it is matched against, and the world simply does not provide us with clear, unambiguous criteria.

Criteria:
The contrasted groups (a type of criteria) are groups that differ significantly on the particular domain
For example, in validating an academic achievement test we could administer the test to two groups of college students, matched on relevant variables such as age and gender, but differing on grade point average, such as honors students vs. those on academic probation.

Two Types
– Predictive – test given first and criterion scores measured at a later time
Example: we want the SAT to predict college GPA. We would need to administer the test to an unselected sample, then wait for them all to finish college and then correlate scores with GPA. It’s unlikely to get a unselected group, will probably have a more homogeneous group. Most researchers don’t want to wait that long
– Concurrent – test given at the same time that the criterion scores are collected
When the criterion and test scores are collected at the same time. The main purpose of such concurrent validation would be to develop a test as a substitute for a more time- consuming or expensive assessment procedure

39
Q

Construct Validity

A

Construct: How well does a test measure the attribute it claims?
– What makes it different is that it is a process that takes place within a theoretical framework
If we wish to validate a test of intelligence, we must be able to specify in a theoretical manner what intelligence is, and we must be able to hypothesize specific outcomes.
– Look for the correspondence between the theory and the observed data
Construct validity is an umbrella term that encompasses any information about a particular test; both content and criterion validity can be subsumed under this broad term.

Test scores are a function three aspects:
1. Test items
2. Person responding
3. Context
Inferences made from scores is very important

Although we speak of validity as a property of a test, validity actually refers to the inference that is made from the test scores
we infer how well the person will perform on a future task (predictive or criterion validity)
whether the person possesses certain knowledge (content validity)
or a psychological construct or characteristic related to an outcome, such as spatial intelligence related to being an engineer (construct validity)

Two Types:
Convergent Validity – High correlation with another test measuring the same construct
D. P. Campbell and Fiske (1959) and D. P. Campbell (1960) proposed that to show construct validity, one must show that a particular test correlates highly with variables, which on the basis of theory, it ought to correlate with
Divergent Validity – Low correlation with another test measuring a different construct
They also argued that a test should not correlate significantly with variables that it ought not to correlate with
Multitrait-multimethod matrix- assess both convergent and discriminant validity

40
Q

Methods for Assessing Construct Validity

A

group differences
the statistical notion of correlation and its derivative of factor analysis, a statistical procedure designed to elucidate the basic dimensions of a data set

internal consistency of the test
Here we typically try to determine whether all of the items in a test are indeed assessing the particular variable, or whether performance on a test might be affected by some other variable.

studies of change over occasions
Is there a change in test scores over time/ or with different examiners?

studies of process
focuses on looking at the process, observing how subjects perform on a test, rather than just what

41
Q

What is the relationship between validity and reliability?

A

Another way that reliability and validity are related is that a test cannot be valid if it is not reliable. In fact, the maximum validity coefficient between two variables is equal to:

Variability= √r11r22,

where r11 again represents the reliability coefficient of the first variable (for example, a test)
r22 the reliability coefficient of the second variable (for example, a criterion).

42
Q

Interpreting a validity coefficient

A

– There is no standard value to surpass
– Determine if the validity is statistically and/or practically significant
– Squaring the validity coefficient gives an estimate of how much overlap there is between the test and
the criterion
– Use the correlation of the test scores to make predictions of the criterion
The purpose of administering a test such as the SAT is to make an informed judgment about whether a high-school senior can do college work, and to predict what that person’s GPA will be. Such a prediction can be made by realizing that a correlation coefficient is simply an index of the relationship between two variables, a relationship that can be expressed by the equation
Y = bX + a
Y might be the GPA we wish to predict
X is the person’s SAT score
b and a reflect other aspects of our data
– Use an expectancy table
Expectancy tables can be more complex and include more than two variables – for example, if gender or type of high school attended were related to SAT scores and GPA, we could include these variables into our table, or create separate tables.
– Break the test and the criterion down into categories
– Use the standard error of estimate to find the margin of error
In talking about reliability, we talked about “noise in the system,” that is lack of perfect reliability. Similarly with validity we ordinarily have a test that has less than perfect validity, and so when we use that test score to predict a criterion score, our predicted score will have a margin of error. That margin of error can be defined as the SE of estimate which equals:
SD √ 1−r^2 12
SD is the standard deviation of the criterion scores
r12 is the validity coefficient
If the test had perfect validity, that is r12 = 1.00, then the SE of estimate is zero; there would be no error, and what we predicted as a criterion score would indeed be correct
If the test were not valid, that is r12 = zero, then the SE of estimate would equal the SD

In general, validity coefficients are significantly lower because we do not expect substantial correlations between tests and complex real-life criteria.
For instance many factors determine your grade other than intelligence (student-teacher relationship, motivation, social life, etc.)
A test may correlate significantly with a criterion, but the significance may reflect a very large sample, rather than practical validity.
Even though an r of .40 looks rather large, and is indeed quite acceptable as a validity coefficient, its explanatory power (16%) is rather low – but this is a reflection of the complexity of the world, rather than a limitation of our tests.

43
Q

Validity Bandwidth Fidelity

A

Cronbach and Gleser (1965) used the term bandwidth to refer to the range of applicability of a test – tests that cover a wide area of functioning such as the MMPI are broad-band tests; tests that cover a narrower area, such as a measure of depression, are narrow-band tests.

These authors also used the term fidelity to refer to the thoroughness of the test. These two aspects interact with each other, so that given a specific amount (such as test items) as bandwidth increases, fidelity decreases.

44
Q

Decision Theory

A

Once the test is shown to be valid on a measure, we can use it to predict a criterion. Since no test is 100% valid, our predictions will have errors.

The test and reality produces four categories

Category A consists of individuals who on the test are positive for TB and indeed do have TB. These individuals, from a psychometric point of view, are considered “hits” – the decision based on the test matches the real world.
Category B consists of individuals for whom the test results indicate that the person does not have (is negative for) TB, and indeed they do not have TB – another category that represents “hits.”
Category C consists of individuals for whom the test results suggest that they are positive for TB, but they do not have TB; these are called false positives.
Category D consists of individuals for whom the test results are negative. They do not appear to have TB but in fact they do; thus they are false negatives.

45
Q

Sensitivity

A

Proportion of correctly identified positives
(i.e., how accurately does a test classify a person who has a particular disorder?)
𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (A) / 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (A) + 𝑓𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 (D)

46
Q

Specificity

A

Proportion of correctly identified negatives
(i.e., how accurately does a test classify those who do NOT have the particular condition?)
𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 (B) / 𝑡𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 (B) + 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (C)

47
Q

Predictive Value (Efficiency)

A

The ratio of true positives to all positives
𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (A) / 𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (A) + 𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 (C)

48
Q

What would the ideal test have in terms of sensitivity, specificity, and predictive value?

A

An ideal test would have a high degree of sensitivity and specificity, as well as high predictive value, with a low number of false positives and false negative decisions.

49
Q

Reducing Errors

A

– The more valid the measure or procedure on which decisions are based, the fewer the errors.
– The more comprehensive (larger sample size) the database available on which to make decisions, the fewer the errors.
– Use sequential strategies, use a cheap test and likely get a lot of false positives then use more expensive test to find the true positives
– Change the decision rules, admissions would rather admit someone who might fail then omit someone who would have succeeded. To do this they lower the or wave standards.
Problem is that more people will fail; False positives is lowered; False negatives are increased
If you do the opposite and raise the standard then: False positives increase and false negatives decrease
– Type of error we are willing to tolerate
Do you want to let people in who might fail or keep people out who might have succeeded?
It would be fine to let everyone into college and have some not graduate but you wouldn’t want to let just anyone be an astronaut
– Selection ratio
One of the issues that effects our decisions and the kinds of error we tolerate
refers to the number of individuals we need to select from the pool of applicants.
If you have a 100 scholarships and 100 applicants then you don’t need to even look at the applicants but if you only have 2 scholarships then you need to be very demanding (this will probably result in high number of false positives)
– Base rate
that is the naturally occurring frequency of a particular behavior
when the base rate of the criterion deviates significantly from a 50% split, the use of a test or procedure that has slight or moderate validity could result in increased errors.
– Sample size
Another aspect that influences validity is the size of the sample that is studied when a test is validated
You will recall whether or not a correlation coefficient is statistically significant or is different from zero, is a function of the sample size.
with a small sample of N = 10, we would need to get a correlation of at least .63 to conclude that the two variables are significantly correlated, but with a large sample of N = 150, the correlation would need to be only .16 or larger to reach the same conclusion
– Validity considered in terms of generalizability
What we have discussed above about validity might be termed the “classical” view.
Currently, validity is focused on validating a test for a specific application with a specific sample and in a specific setting; it is largely based on theory, and construct validity seems to be rapidly gaining ground as the method.
If we correlated GPA and SAT scores at one school we would not expect the same correlation from another school. We expect a certain amount of stability of results across studies, but on the other, when we don’t obtain such stability, we need to be aware and identify the various sources for obtaining different results.

50
Q

Validity from an individual point of view: Primary Validity

A

primary validity is basically similar to criterion validity

If someone publishes a new academic achievement test, we would want to see how well the test correlates with GPA, whether the test can in fact separate honors students from nonhonors students, and so on.

This is called primary because if a test does not have this kind of basic validity, we must look elsewhere for a useful measure.

Similar to criterion validity, how well a test correlates with the criteria

How well the test predicts you will do on a job

51
Q

Validity from an individual point of view: Secondary Validity

A

If the evidence indicates that a test has primary validity, then we move on to secondary validity that addresses the psychological basis of measurement of the scale.

To obtain information on secondary validity, on the underlying psychological dimension that is being measured, Gough (1965) suggested four steps:
(1) reviewing the theory behind the test and the procedures and samples used to develop the test
(2) analyzing from a logical-clinical point of view the item con- tent (Is a measure of depression made up primarily of items that reflect low self-esteem?)
(3) relating scores on the measure being considered to variables that are considered to be important, such as gender, intelligence, and socioeconomic status
(4) obtaining information about what high scorers and low scorers on the scale are like psychologically.

Reviewing the theory behind the test
Kind of like construct and content validity