Measurement Flashcards by Shelly Bates

What is validity?

The degree to which it measures what it is supposed to measure. Within validity, the measurement does not always have to be similar, as it does in reliability. However, just because a measure is reliable, it is not necessarily valid

Criterion Validity- to measure this, researchers must calibrate it against a known standard or against itself.

Consequential Validity-positive or negative social consequences of a test. The test must not have negative social consequences that seem abnormal.
Internal consistency-how consistently the items on a test measure a single construct or concept; only requires a group of people to take a test at once, no time allowance is needed.

How well did you know this?

Not at all

Perfectly

Define content validity.

Content Validity
Also known as logical validity, is a verification that the method of measurement actually measures what it is expected to measure. Content validity is a type of validity that focuses on how well each question taps into the specific construct in question. Content validity requires the use of recognized subject matter experts to evaluate whether test items assess defined content and more rigorous statistical tests that does the assessment of face validity. We use CVI and CVR to do this.

How well did you know this?

Not at all

Perfectly

Define Construct Validity.

Construct validity refers to the extent to which a higher-order construct, such as help seeking, teacher stress, or dyslexia, is accurately represented in the particular study. Construct validity is fostered by having a good definition and explanation of the meaning of the construct of interest. Construct validity evidence involves the empirical and theoretical support for the interpretation of the construct.

How well did you know this?

Not at all

Perfectly

Define criterion validity.

To measure the criterion validity of a test, researchers must calibrate it against a known standard or against itself. Comparing the test with an established measure is known as concurrent validity; testing it over a period of time is known as predictive validity. In psychometrics, criterion or concrete validity is the extent to which a measure is related to an outcome. Criterion validity is often divided into concurrent and predictive validity.

How well did you know this?

Not at all

Perfectly

Define concurrent validity.

Concurrent validity refers to a comparison between the measure in question and an outcome assessed at the same time. Predictive validiy measures the degree to which the test scores measuring one test criterion is consistent with other criterion being measured. Concurrent validity is established when the scores from a new measurement procedure are directly related to the scores from a well-established measurement procedure for the same construct; that is, there is consistent relationship between the scores from the two measurement procedures. Criterion validity is a good test of whether such newly applied measurement procedures reflect the criterion upon which they are based. When they do not, this suggests that new measurement procedures need to be created that are more appropriate for the new context, location, and/or culture of interest.

How well did you know this?

Not at all

Perfectly

Define consequential validity.

Consequential validity refers to the positive or negative social consequences of a particular test. For example, the consequential validity of standardized tests include many positive attributes, including: improved student learning and motivation and ensuring that all students have access to equal classroom content.Consequential validity in testing describes the after effects and possible social and societal results from a particular assessment or measure. For an assessment to have consequential validity it must not have negative social consequences that seem abnormal.

How well did you know this?

Not at all

Perfectly

What is reliability?

the consistency or stability of test scores. If scores are reliable, they will be similar on every occasion.

How well did you know this?

Not at all

Perfectly

Identify ways of computing reliability.

Test-retest- the reliability of test scores over time
Give test once, 2nd time and correlate scores. High indicates score reliability.

Parallel Forms-consistency of a group of individual’s scores on alternative forms of a test designed to measure the same characteristic. IDentifical except for the items on the tests. Same # items, same difficulty level, items measure same construct; test is administered, scored, and interpreted the same way.
2 scores correlated, want high score and positive.

Interrator reliability-the degree of agreement or consistency between 2 or more scores, judges, or raters.
(Percentage agreement, Cohen’s Kappa, Generalizability)

How well did you know this?

Not at all

Perfectly

What are the two types criterion validity?

Concurrent-comparison between the measure in question and an outcome assessed at the same time.
This is established when the scores from a new measurement procedure are directly related to the scores from a well-established measurement procedure for the same construct.

Predictive- tells you how well a certain measure can predict future behavior; one test is consistent with other tests.

How well did you know this?

Not at all

Perfectly

Which kind of tests have more inter item consistency and why?

Homogenous tests because the items focus on one construct and sample a more narrow content area.

How well did you know this?

Not at all

Perfectly

What are the two indexes of internal consistency?

Cronbach’s Alpha- aka coefficient alpha, The coefficient tells you the degree in which the items are interrelated. It should be greater than or equal to .70 for research purposes and somewhat greater in value for clinical testing purposes(single people). It provides an estimate of the average of all possible split-half correlations.

Split Half Reliability-splitting the test into 2 equivalent halves and then assessing the consistency of scores across the 2 halves of the test, specifically by correlating the scores from the 2 halves. Two ways; split down the middle (not recommended) or use odd-numbered items for one half of the test and even numbers for the other half. RA to 1 half or the other. Then, score ea 1/2, compute correlation between scores of two halves. Adjust the computer correlation using Spearman-Brown formula.

How well did you know this?

Not at all

Perfectly

What are the differences between Classical Test Theory and IRT?

CTT-a theory about test scores that introduces 3 concepts-test score (often called the observed score), true score, and error score. It uses a common estimate of the measurement precision that is assumed to be equal for all individuals despite their attribute levels.
The longer the test, the more reliable.

IRT-item response theory, a general statistical theory about examinee item and test performance and how performance relates to the abilities that are measured by the items in the test.
A shorter test that is more reliable can be designed.
ABILITY LEVEL

How well did you know this?

Not at all

Perfectly

IRT and discuss parameters and differences

Item difficulty is expressed in terms of trait level.
1 parameter-just item’s difficulty
2 parameter-Within this model, item discrimination and difficulty level
3 parameter-difficulty, discrimination, and guessing (likelihood of correctly answering item purely based on guessing

How well did you know this?

Not at all

Perfectly

What is a cut score?

Selected point on the score scale of a test. The points are used to determine whether a particular test score is sufficient for some purpose. Cut scores should be based on a generally accepted methodology and reflected the judgement of qualified people.

How well did you know this?

Not at all

Perfectly

Angoff Method

Ask experts to rate the probability that a minimally competent person will get each item correct. This can be used with tests that are not multiple choice. The cut score is computed from the expected scores fro the individual questions.

How well did you know this?

Not at all

Perfectly

Nedelsky Method

Study These Flashcards

Can be used for multiple choice tests. It requires judgement about each possible wrong answer that a borderline test taker would be able to recognize as wrong. If the best answers are presented. This method is based on the idea that the borderline test taker responds to a multiple question by first eliminating the answers he or she recognizes as wrong and then guesses at random from remaining answers.

What are some forms of performance assessment?

Study These Flashcards

Portfolio
simulation
situational judgement tasks
rubrics

What is interrater reliability? and what are 3 ways to abolish that?

Study These Flashcards

the degree to which different raters or judges make consistent estimates of the same phenomenon

Percentage agreement-
Cohen’s Kappa-represents the proprortion of agreement among raters after chance agreement has been factored out.
generalizability-the meaning of the test scores should not be limited merely to the sample of tasks that constitute the test but rather should be generalizable to the construct domain intended to be assess.

What do we assess?

Study These Flashcards

Evaluate
Selection
Predication
Make Decisions

What are the differences in summative and formative assessment?

Study These Flashcards

formative- feedback and correction

Summative-outcome or output

what are ways to assess

Study These Flashcards

Tests
Performance Assessments
Personality, attitude, motivational-self report questionnaires
Analytics/stealth assessment

What are some test item statistics?

Study These Flashcards

item difficulty
item discrimintation
IRT

Performance Assessment Points

Study These Flashcards

Development focusing on processes and knowledge
Simulated or actual performance measures
Construct versus task centered performance assessments

Validity for assessments

Study These Flashcards

Content validity ratio-ratings from SMEs
Is this test related to the job content? Look at standards of practice and competences.

content validity, concurrent validity, and predictive validity

Content-SME review, CVR

Construct-factor analysis, correlation, item response theory model (IRT); compare to other tests

criterion-related-correlation coefficient b/w 2 measures
Look at standards of practice

Reliability for Assessment

test-retest reliability, parallel forms reliability, decision consistency, internal consistency, and inter rater reliability. internal consistency- Cronbach's alpha (usually used for a survey or questionnaire) test-retest- inter rater- rubric, correlation coefficients, Cohen's Kappa (agreement b/w raters)

Measurement Flashcards

(25 cards)