Reliability of a test Flashcards by Julie Bandillon

dependability or consistency of the instrument or scores obtained by the same person when re-examined with the same test on different occasions, or with different sets of equivalent items

Reliability

How well did you know this?

Not at all

Perfectly

index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance

Reliability Coefficient

How well did you know this?

Not at all

Perfectly

score on an ability test is presumed to reflect not only the testtaker’s true score on the ability being measured but also the error

Classical Test Theory (True Score Theory)

How well did you know this?

Not at all

Perfectly

refers to the component of the observed test score that does not have to do with the testtaker’s ability

Error

How well did you know this?

Not at all

Perfectly

Factors that contribute to consistency

stable attributes

How well did you know this?

Not at all

Perfectly

Factors that contribute to inconsistency

characteristics of the individual, test, or situation, which have nothing to do with the attribute being measured, but still affect the scores

How well did you know this?

Not at all

Perfectly

Goals of Reliability:

EEDT
✓ Estimate errors
✓ Devise techniques to improve testing and reduce errors

How well did you know this?

Not at all

Perfectly

useful in describing sources of test score variability

Variance

How well did you know this?

Not at all

Perfectly

variance from true differences

True Variance

How well did you know this?

Not at all

Perfectly

variance from irrelevant random sources

Error Variance

How well did you know this?

Not at all

Perfectly

all of the factors associated with the process of measuring some variable, other than the variable being measured

Measurement Error

How well did you know this?

Not at all

Perfectly

difference between the observed score and the true score

Measurement Error

How well did you know this?

Not at all

Perfectly

Sources of Error Variance that refer to variation among items within a test as well as to variation among items between tests

The extent to which testtaker’s score is affected by the content sampled on a test and by the way the content is sampled is a source of error variance

Item Sampling/Content Sampling

How well did you know this?

Not at all

Perfectly

Sources of Error Variance that testtaker’s motivation or attention, environment, etc.

Test Administration

How well did you know this?

Not at all

Perfectly

Sources of Error Variance that may employ objective-type items amenable to computer scoring of well-documented reliability

Test Scoring and Interpretation

How well did you know this?

Not at all

Perfectly

source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in measurement process (e.g., noise, temperature, weather)

Random Error

How well did you know this?

Not at all

Perfectly

source of error in a measuring a variable that is typically constant or proportionate to what is presumed to be the true values of the variable being measured

has consistent effect on the true score -

SD does not change, the mean does

Systematic Error

How well did you know this?

Not at all

Perfectly

________ refers to the proportion of total variance attributed to true variance

Reliability

How well did you know this?

Not at all

Perfectly

The _____ the proportion of the total variance attributed to true variance, the ________ the test

greater - more reliable

How well did you know this?

Not at all

Perfectly

___________ may increase or decrease a test score by varying amounts, consistency of test score, and thus, the reliability can be affected

Error variance

How well did you know this?

Not at all

Perfectly

Error: Time Sampling

Test-Retest Reliability

How well did you know this?

Not at all

Perfectly

an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the test

Test-Retest Reliability

How well did you know this?

Not at all

Perfectly

appropriate when evaluating the reliability of a test that purports to measure an enduring and stable attribute such as personality trait

established by comparing the scores obtained from two successive measurements of the same individuals and calculating a correlated between the two set of scores

Test-Retest Reliability

How well did you know this?

Not at all

Perfectly

the longer the time passes, the greater likelihood that the reliability coefficient would be insignificant

Test-Retest Reliability

How well did you know this?

Not at all

Perfectly

happened when the test-retest interval is short, wherein the second test is influenced by the first test because they remember or practiced the previous test = inflated correlation/overestimation of reliability

Carryover Effects

scores on the second session are higher due to their experience of the first session of testing

Practice Effect

test-retest with ______ interval might be affected of other extreme factors, thus, resulting to _____ correlation

longer - low

problems in absences in second session (just remove the first tests of the absents)

Mortality

statistical tool Test-Retest Reliability

Pearson R, Spearman Rho

Error: Item Sampling (Immediate), Item Sampling changes over time (delaued)

Parallel Forms/Alternate Forms Reliability

established when at least two different versions of the test yield almost the same scores - has the most universal applicability

Parallel Forms/Alternate Forms Reliability

each form of the test, the means, and the variances, are EQUAL; same items, different positionings/numberings

Parallel Forms

simply different version of a test that has been constructed so as to be parallel

Alternate Forms

- technique to avoid carryover effects for parallel forms, by using different sequence for groups - can be administered on the same day or different time

Counterbalancing

most rigorous and burdensome, since test developers create two forms of the test - main problem: difference between the two test - test scores may be affected by motivation, fatigue, or intervening events - means and the variances of the observed scores must be equal for two forms - Statistical Tool: Pearson R or Spearman Rho

Counterbalancing

used when tests are administered once - consistency among items within the test - measures the internal consistency of the test which is the degree to which each item measures the same construct

Internal Consistency (Inter-Item Reliability)

Error: Item Sampling Homogeneity

Internal Consistency (Inter-Item Reliability)

measurement for unstable traits - if all items measure the same construct, then it has a good internal consistency

Internal Consistency (Inter-Item Reliability)

if a test contains items that measure a single trait (unifactorial)

Homogeneity

degree to which a test measures different factors (more than one factor/trait) - more homogenous = higher inter-item consistency

- Heterogeneity

______ homogenous = _____ inter-item consistency

more - higher

used for inter-item consistency of dichotomous items (intelligence tests, personality tests with yes or no options, multiple choice), unequal variances, dichotomous scored

KR-20

if all the items have the same degree of difficulty (speed tests), equal variances, dichotomous scored

KR-21

used when two halves of the test have unequal variances and on tests containing non-dichotomous items, unequal variances

Cronbach’s Coefficient Alpha

measure used to evaluate internal consistence of a test that focuses on the degree of differences that exists between item scores

Average Proportional Distance

obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered ONCE

Split Half Reliability

Error: Item sample: Nature of Split

Split-Half Reliability

useful when it is impractical or undesirable to assess reliability with two tests or to administer a test twice - cannot just divide the items in the middle because it might spuriously raise or lower the reliability coefficient, so just randomly assign items or assign odd-numbered items to one half and even-numbered items to the other half

Split-Half Reliability

allows a test developer of user to estimate internal consistency reliability from a correlation of two halves of a test, if each half had been the length of the whole test and have the equal variances

Spearman-Brown Formula

estimates how many more items are needed in order to achieve the target reliability

Spearman-Brown Prophecy Formula

counterpart of spearman-brown formula, which is the ratio of the variance of difference between the odd and even splits and the variance of the total, combined odd-even, score - if the reliability of the original test is relatively low, then developer could create new items, clarify test instructions, or simplifying the scoring rules - equal variances, dichotomous scored

Rulon’s Formula

Error: Scorer Difference

Inter-Scorer Reliability

the degree of agreement or consistency between two or more scorers with regard to a particular measure - used for coding nonbehavioral behavior - observer differences

Inter-Scorer Reliability

determine the level between TWO or MORE raters when the method of assessment is measured on CATEGORICAL SCALE

Fleiss Kappa

two raters only

Cohen’s Kappa

two or more rater, based on observed disagreement corrected for disagreement expected by chance

Krippendorff’s Alpha

Tests designed to measure one factor _____ are expected to have _____ of internal consistency and vice versa

(Homogenous) - high degree

trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experience

Dynamic

barely changing or relatively unchanging

Static

– if the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower

Restriction of range or Restriction of variance

when time limit is long enough to allow test takers to attempt all times

Power Tests

generally contains items of uniform level of difficulty with time limit

Speed Tests

Reliability should be based on performance from two independent testing periods using _______ and _________ or split-half-reliability

test-retest - alternate-forms

designed to provide an indication of where a testtaker stands with respect to some variable or criterion

Criterion-Referenced Tests

As individual differences ______, a traditional measure of reliability would also_______, regardless of the stability of individual performance

decrease - decrease

everyone has a “true score” on test

Classical Test Theory

genuinely reflects an individual’s ability level as measured by a particular test

True Score

estimate the extent to which specific sources of variation under defined conditions are contributing to the test scores

Domain Sampling Theory

_______ is conceived of as an objective measure of how precisely the test score assesses the domain from which the test draws a sample

Test reliability

based on the idea that a person’s test scores vary from testing to testing because of the variables in the testing situations

Generalizability Theory

test situation

Universe:

number of items in the test, amount of review, and the purpose of test administration

Facets

According to ____________, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained (Universe score)

Generalizability Theory

developers examine the usefulness of test scores in helping the test user make decisions

Decision Study

the probability that a person with X ability will be able to perform at a level of Y in a test Focus: item difficulty

Item Response Theory

a system of assumption about measurement and the extent to which item measures the trait

Latent-Trait Theory

The ______ is used to focus on the range of item difficulty that helps assess an individual’s ability level

computer

attribute of not being easily accomplished, solved, or comprehended

Difficulty

degree to which an item differentiates among people with higher or lower levels of the trait, ability etc.

Discrimination

can be answered with only one of two alternative responses

Dichotomous

3 or more alternative responses

Polytomous

provide a measure of the precision of an observed test score

Standard Error of Measurement

Standard deviation of errors as the ________ of error

basic measure

Index of the amount of inconsistent or the amount of the ______ error in an individual’s score

expected

Higher reliability the ______

lower Standard Error of Measurement

a range or band of test scores that is likely to contain true scores

Confidence Interval

can aid a test user in determining how large a difference should be before it is considered statistically significant

Standard Error of the Difference

refers to the standard error of the difference between the predicted and observed values

Standard Error of Estimate

a range of and of test score that is likely to contain true score Tells us the relative ability of the true score within the specified range and confidence level The larger the range, the higher the confidence

Confidence Interval

If the reliability is low, you can increase the number of _____ or use factor analysis and item analysis to increase internal consistency

items

nature of the test will often determine the reliability metric

Reliability Estimates

detects true positive

Test Sensitivity

detects true negative

Test Specificity

proportion of the population that actually possess the characteristic of interest

Base Rate –

– no. of available positions compared to the no. of applicants

Selection ratio

one of the Four Possible Hit and Miss Outcomes– predict success that does occur

True Positives (Sensitivity)

one of the Four Possible Hit and Miss Outcomes – predict failure that does occur

True Negatives (Specificity)

one of the Four Possible Hit and Miss Outcomes – success does not occur

False Positive (Type 1)

one of the Four Possible Hit and Miss Outcomes – predicted failure but succeed

False Negative (Type 2)

Reliability of a test Flashcards

(99 cards)