Ch 6 - Item Statistics Flashcards

1
Q

Test items

A

units that make up a test and the means through which samples of test taker’s behaviour are gathered

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Item analysis

A

general term, refers to all the techniques used to asses the characteristics of test items and evaluate their quality during the process of test development and test construction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Qualitative item analysis

A

rely on the judgements of reviewers concerning the substantive/stylistic characteristics of items, as well as their accuracy and fairness

	○ Appropriateness of item content and format to the purpose of the tests and the populations for which it's designed
	○ Clarity of expression
	○ Grammatical correctness
	○ Adherence to some basic rules for writing items that have evolved over time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Quantitative item analysis

A

variety of statistical procedures designed to ascertain the psychometric characteristics of items based on the responses obtained from the samples used in the process of test development

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Bias in the context of psychometrics

A

measurement bias: systematic error that enters into scores and affects their meaning in relation to what the scores are designed to measure/predict

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Steps of test development

A
  1. Generating item pool - creating test items, and their administration/scoring procedures
    1. Submit item pool to qualitative analysis to experts
    2. Revise/replace items that are problematic
    3. Try the items on samples that are representative of the intended pop
    4. Evaluate the results through quantitative item analysis
    5. Add/modify/delete items as needed
    6. Conduct additional trial administrations for checking if item statistics remains stable across samples (AKA cross-validation)
    7. Determine the length of the test and the sequencing of items, and the scoring/administration procedures
    8. Administer the test to a new sample - representative of pop - in order to develop normative data
    9. Publish the test, along with administration/scoring manual and intended uses, development procedures, standardization data, reliability/validity studies, and materials needed for test administration, scoring and interpretation

**Steps mostly for pen-and-paper pencil, for CAT the procedures are different - relies more on item banking
Tests also need to go through this process again to be revised - due to the changing norms/criterion/flynn effect mentioned in ch 3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Selected-Response Items

AKA Objective or fixed-response items

A

Close-ended in nature - limited nbr of alternatives from which the respondent can choose from

In ability tests:
• MC, true false, ranking, matching
• Usually scored as pass-fail

In personality tests:
• Dichotomous (true false, yes no, like dislike, etc)
• Polytomous (more than 2 options)
Scaled in terms of degree of acceptance, intensity of agreement, frequency, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Forced-Choice items

A

Respondent needs to choose which option represents them the + or the -
Each of the option represents a construct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Ipsative scores in the context of forced choice

A

Resulting scores are ipsative in nature: essentially ordinal numbers that reflect test taker’s rankings of the constructs assessed by the scales within a forced choice format test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Advantages of Selected-Response Items

A

• Ease and objectivity of scoring - enhances reliability, saves time
• Make efficient use of testing time
• Can be administered individually, but also collectively
Can be easily transformed into numerical scales - facilitates quantitative analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Disadvantages of Selected-Response Items

A

• Issue of guessing (can be up to 50% in dichotomous items)
• Similarly, wrong answers can happen due to inattention, haste, etc
• Items can be misleading
• Can be more easily manipulated bc of demand characteristics
○ Many personality inventories use validity scales to account for that
• Preparing selected response items is difficult and requires great skills
○ Carelessly constructed items can include:
§ Options not grammatically related with the question
§ Options susceptible to more than 1 interpretation
§ Options so dumb that they can be easily dismissed
• Selected-response items are less flexible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Constructed-Response Items

AKA free-response items

A

Variety is limitless - constructed responses may involve writing samples, free oral responses, performances of any kind, and products of all sorts

In ability tests
• Essay questions
• Fill-in-the-blanks
• Thorough instructions and procedural rules are indispensable for standard administration of free-response items
○ Time limits
○ Medium, manner or length of the required response
○ Whether access to materials/instruments is permitted

In personality tests
• Interviews
• Biographical data
• Behavioural observations
• Projective techniques (AKA performance-based measures of personality)
○ Responses to ambiguous stimuli
Respondents can respond freely, revealing aspects of their personality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Advantages of Constructed-Response Items

A

• Provide richer samples of the behaviour of examinees
• Offer a wider range of possibilities/creative approaches to test/assess
Elicit authentic samples of behaviour

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Disadvantages of Constructed-Response Items

A

• Scoring is more time consuming and complex because of the presence of subjectivity
○ Even with scoring rubrics
• Checking for inter-rater reliability is essential
• Scorers need constant monitoring and thorough training
• Projective responses are even more susceptible to subjective scoring errors
• Because of the length of time it takes to answer, less items can be answered in the same amount of time than for selected-response items
○ Shorter tests are more prone to content sampling errors and producing less consistent scores
○ Less reliability
Response length can vary - therefore the number of scorable elements also varies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Meaning of “discrimination” in psychometrics

A

Considered a desirable feature of test items. It refers to the extent to which items elicit responses that accurately differentiate test takers along the dimensions that tests are designed to evaluate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Item validity

A

most important aspect of quantitative item analysis
• Whether a specific items carries its own weight within a test by eliciting information that advances the purpose of the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Item discrimination

A

way to refer to item validity statistics
• Refers to the extent to which an item accurately differentiates among test takers with regard to the trait/behaviour the test is supposed to measure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

For ability tests, item analysis for validity includes item validity, discrimination, AND? (2)

A

Item difficulty

Item fairness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How is Item Difficulty Gauged? (CTT)

A

At the beginning; test specifications by experts in the field can be used as difficulty criteria
Once it’s administered to a group: quantitative indexes can be obtained (normative perspective)
• Using the % of test takers who answer an item correctly (AKA proportion/percentage passing, “p”)
• The higher p, the easier the item is
• P is an ordinal number (like percentile ranks), therefore it’s often converted to a Z score
• Once we have a Z score for items the difficulty of items can be compared across various groups by administering anchor items (common set of items) to 2+ groups
• Formulas to estimate the difficulty of additional items across the groups in question can be derived based on the established relationships among the anchor items - AKA absolute scaling
○ Allows for the difficulty of items to be placed on a uniform numerical scale

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Explain this: “For any given group/test, the average score on a test is the same as the average difficulty of its items”

A
  • Ex: in a classroom test designed to evaluate how much of the content students grasped, there will be items that everyone gets (p=1), and others that the average students guess (p=0.7), and very little, if none, that no student will get (p=0), so that the average grade will be around 0.7-0.8.
    • In a test designed to determine the top 10% of students, we expect most items to have a p value of around 0.1, so that the average will be 0.1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Disctractors

A

the incorrect alternatives in multiple choice items
Can have great influence on item difficulty:
• The number of distractors influences the probability of guessing right/wrong
The plausibility of the distractors to the test takers that don’t know the right answer significantly influences the difficulty of the item

Analyses of distractors need to be conducted:
• Proportion of time respondents choose each distractor
• To detect possible flaws and eventually replace the ones that don’t work correctly
If a distractor is never chosen or chosen more often than the right answer, then it’s not working

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Is Item Difficulty a Relevant Concept in Personality Testing?

A
  • Selected response: The ability of the test takers to understand the items (reading and vocabulary abilities) must be taken into consideration so that their answers are more truthful
    • Projective tasks: need an amount of proficiency in the mode of answering (talking/ writing)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Item Validity

A

Refers to the extent to which items elicit responses that accurately differentiate test takers in terms of the behaviours, knowledge, or other characteristics that a test is designed to evaluate

• Discriminating power: most basic quality of test items 
• Validity indexes/indexes of item discrimination - obtained using some criterion of the test takers' standing on the construct that the test assesses. Can be:
	○ Internal criteria (ex: total score on the test) - increases homogeneity of test (increases reliability due to interitem consistency)
		§ Often for tests evaluating a single construct/trait
		§ Based on the assumption that all test items should correlate highly with the construct of interest, and with each other
	○ External criteria (ex: age, education, diagnostic, etc) - increases score validity
		§ Often used for tests evaluating many different aspects/constructs
		§ The correlation between the items and test scores is not expected to be high
	○ Combination of both
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Index of discrimination statistic (D) (CTT)

A

○ Used for hand calculations when computer is not accessible
○ For validity of items
○ Mainly applied in pass/fail items in ability tests, but also other types of binary scoring possible
○ Test takers must be divided into criterion groups based on scores or external criterion
§ Usually the top and lower thirds of people are taken as the groups to be compared
§ The % of people passing the item is used to calculate the difference in the % of test takers in the upper and lower criterion groups who pass a given item
○ Can range from +100 to -100 (AKA from 1 to -1)
§ Positive D indicates that more individuals in the upper criterion than in the lower passed the item (most desirable values of D are closest to +1)
§ Negative D indicates that the items in question discriminate in the opposite direction and need to be fixed/discarded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

2 other correlational indexes (other than D) to measure item validity

A

○ Most widely used classical test theory methods for expressing item validity
○ The type of coefficient chosen depends on the nature of the 2 variables that are to be correlated (AKA the item scores and the criterion measures)
§ When item scores are dichotomous, and criterion measure is continuous - point biserial (rpb) is best
§ When item and criterion measures are both dichotomous - phi coefficient is best
Both of these can range from -1 to +1 and are interpreted same as Pearson r

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

3 types of tests regarding speed

A

Speed: speed of performance
Tests can be classified in 3 types
• Pure speed tests
○ Simply measure the speed with which test takers can perform a task
○ Difficulty is manipulated mainly through timing
○ Score is often the number of items completed in the allotted time
• Pure power tests
○ Have no time limits
○ Difficulty is manipulated by increasing or decreasing the complexity of items
○ Items are in ascending order of difficulty
○ Only the best respondents can answer all items
• Tests that blend speed and power

In any test that’s closely timed, the p value is a function of the position of items within the test rather than of their intrinsic difficulty/validity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Item-test regression

A

Necessary to calculate the proportion of individuals at each total score level who passed a given item
Item-test regression graphs combine info on both item difficulty and item discrimination - allow to visualize how each item functions within the group that was tested

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Item Response Theory

A

Variety of models that can be used to design/develop new tests and to evaluate existing ones
• IRT differs from classical theory in:
○ The mathematical formulas they employ
○ The nbr of item characteristics they account for
○ The number of trait/ability dimensions they specify as the objective of measurement
○ Use different methods depending on dichotomous/polytomous items

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Main differences between CTT and IRT

A

• Test Length
• Comparison of scores:
○ CTT: compares total test score
○ IRT: focuses on the scores on individual items since the items can be different for each examinee (CAT)
§ Goals of IRT:
□ Generate items that provide the maximum amount of information possible concerning the ability/trait levels of examinees who respond to them in one fashion or another
□ Give examinees items tailored to their abilities
□ Reduce the number of items needed to pinpoint any given test taker’s standing on the ability while minimizing measurement error

30
Q

Shortcomings of CTT

That IRT attempts to overcome

A

• CTT indexes of item difficulty and item discrimination are group dependent: their values may change when computed for samples of test takers who differ from the ones used for the initial item analyses in some aspect of the construct being measured
○ The characteristics obtained through IRT are assumed to be invariant and provide a uniform scale of measurement that can be used with different groups
• For tests of fixed length developped with CTT, the trait/ability estimates (AKA the scores) are test dependent: they are a function of the specific test items selected for inclusion in a test. Comparisons of scores derived from different tests are therefore not possible wihtout equating procedures
○ With IRT, estimates of abilities/traits are independent of the particular item set adminstered to examinees - trait estimates are linked to the probabilities of examinees’ item response patterns - they can be compared without equating
• In CTT, the reliability of scores is usually gauged by means of the standard error of measurement (SEM), which is assumed to be of equal magnitude for all examinees. BUT in reality, accuracy is not equal throughout the score range - it depends on how well suited test items are to examinees’ trait or ability levels.
When IRT is combined with adaptive testing procedures, the standard errors of trait/ability estimates resulting from a test administration depend on the particular set of items selected for each examinee - these SEM estimates vary appropriately at different levels of the trait dimensions and convey more appropriate info about the accuracy of measurement

31
Q

Unidimensional IRT models assume:

A

• That the items comprising a test measure a single trait
• The items responses of test takers depend only on their standing with regard to the trait being measured
None of these assumptions can ever be fully met, but they can be met enough for the model to be workable

32
Q

Common features of IRT models

A

• IRT is based on the prediction that a person’s performance on any test item is a function of one or + traits/abilities
○ The models seek to specify the relationship between the response to items and the traits that underlie them
○ IRT models can be evaluated in terms of how well they predict this relationship
• IRT employ tests and item response data from large samples known to differ on the ability/trait that the test is designed to assess, not necessarily representative of a defined pop
• After item/test score data is collected, it’s used to derive estimates of item parameters that will place test takers/items along a common scale for the ability/trait dimension

33
Q

Item parameters

A

the numerical values that specify the form of the relationships between the abilities/traits being measured and the probability of a certain item response

34
Q

Item difficulty parameters (IRT)

A

express the difficulty of an item in terms of the ability scale position where the probability of passing the item is 0.5 (AKA dichotomous item)

35
Q

item characteristic curve (ICC)

A

is the graphic representation of a mathematical function that relates item response probabilities to trait levels, given the item parameters that have been specified
○ Ex: the ICC of dichotomous abiltiy test expresses the expected relationship between ability level and probability of passing an item

36
Q

Item information function

A

reflects the contribution an item makes to trait or ability estimation at different points in the trait or ability continuum.
• Helps to decide whether and where to incorporate items into a test
• The test information function corresponds to the CTT notion of score reliability
• Test information functions are used to obtain standard errors of estimation at each level in the trait/ability scale
○ Can be used to create Cis for the ability estimates in a similar way that the traditional standard errors of measurement in CTT are used to create confidence intervals for obtained scores

37
Q

Accessibility of testing

A

• Writers need to be aware of the principles of universal design: desire to make a product available to older people and to people with disabilities (in test items - ensuring that their format is accessible to everyone regardless of their age, language, gender, ethnicity, disability, etc)
○ CAT allows for even more accessibility than before

38
Q

Qualitative Analysis of Item Bias

Evaluating items on the basis of fairness is done with judgmental procedures (2)

A

1- During the initial phase of construction (items are being written/generated)
§ Screening out any stereotypical depictions of subgroups
§ Eliminating items whose content may be offensive to minorities or place them at a disadvantage
§ Ensuring that subgroups are appropriately represented
2- Once the items have been administered and item performance data has been analyzed for subgroups
§ Items that show subgroups differences in difficulty, discrimination or both are examined/modified/discarded

39
Q

Quantitative analysis of item bias

A

Sometimes just the difference in relative difficulty of test items for individuals in diverse demographic groups
• View not shared by all professionals
Many specialists differentiate item bias from differential item functioning (AKA DIF): what occurs when different groups who have the same standing on a trait differ in the probability of responding to the item in a specified manner
• BUT, both terms are still often used interchangeably

40
Q

Assessing Differential Item Functioning (DIF)

A

Involves analysis of item difficulty and discrimination for subgroups
• This analysis is more complicated because of the fact that subgroups differ in their average performance and variability (especially on ability tests)

41
Q

What happens when differences in DIF are found in test scores:

A

○ Item difficulty statistics become confounded by valid differences between groups in the ability that a test measures
○ Correlational indexes of item discrimination are affected by the differences in variability within the groups being compared

42
Q

Specialized methods have been designed for the proper assessment of DIF
One of most commonly used: Mantel-Haenszel (MH) technique

A

• Each of the groups in question is divided into subgroups based on total test score
• And item performance is assessed across comparable subgroups
○ Caveat: the total score (an internal criterion) may be insensitive to differences in item functioning across groups
○ Caveat no 2: its ability to detect DIF is dependent on the use of very large groups

43
Q

IRT: provides better techniques than CTT for investigating DIF

A
  • IRT: identifies anchor items that show no DIF across the groups of interest
    • If the parameters / ICCs for 2 groups for a given item are the same, it may be inferred that the item functions equally well for both groups
44
Q

Iterative Process - process of test construction is iterative

A

(we write items, try them out, modify them, etc)
• Writing clarity is a difficulty - to make sure that they are clear, well-written, straightforward, have a single meaning (alpha coefficient would be relevant here, since unclear items would lower its value)
○ The item should have the same meaning for everyone, and the variation will come from the respondent’s answers

45
Q

Semi-structured interview

A

§ The interviewer has some discretion about questions, but is also guided in which questions to ask
§ The answer on some questions will dictate which questions should be asked next

46
Q

unstructured interview

A

where it’s a discussion and the interviewer asks questions and records the responses
§ Issue; how can we extract meaningful information from that type of exchange?
§ Strengths/advantages:
□ Rich in data (however it depends on the interviewer’s skills - even with training there are still huge sources of individual differences)
□ Hard to process the data (again, there is a big element of subjectivity)

47
Q

Variations for Likert scales

A

□ Many variations possible for Likert scales
® Generally have from 3 to 9-10 items (rarely have more than that because the distinctions become too narrow/subjective) - odd number
® The odd number signals that the middle category is neutral/don’t know/undecided/etc.
® Those with an even number of categories - no mid way response (between 2 and 8 options usually) - we don’t want “don’t know” answers - a further way to restrict the answers
® Can have verbal descriptors for the levels as well as numerical values (typically integers) along with them (ex: Strongly Agree is a 5)
◊ Whether integers are shown to the examinee or not, they are understood at being at the ordinal level of measurement
◊ The numbers associated with the responses are arbitrary (ex: in different questionnaires, a 1 could be Strongly Disagree or Strongly agree)
the numbers could even be something like 0,2,4,6,8, or any set of 5 numbers

48
Q

influence of numbers on Likert scales

A

® Schwarz made a study: if the numerical values were present, it might make a difference in how they are coded (the examinee seeing the numbers might changer their answer)
◊ Ex: 2 groups were each administered one of those scales with the same question:

					◊ Same question, same end points,  but the numbers representing each category changes between the 2 conditions
						} For those who were presented with scale from 0 to 10, those who answered in the lower half (0-5) was about 30%
						} In the other condition, those who answered -5 to 0 was 15%
					◊ These results were interpreted as indicating a difference between unipolar scales (1st option, there is less or more success only) and a bipolar success scale (2nd option, there is NO success (failure), not just low success or some success - it has a different interpretation / connotation)
49
Q

Item difficulty: p (CTT)

A

Can be 0 or 1, more or less
If the p value for a particular item is either very close to 0 or to 1, those items are worthless/need to be deleted/revised because they do not indicate differences among the respondents (everyone has or does not have it)

S2i (variance for an item) = p x q
• Where p = proportion passing the item
• Where q = proportion not passing the item (=1-p)
• P and q MUST add up to 1.0
• Max value of s2i is obtained when p and q is 0.5
• You have a maximum range of individual differences as measured by indiviudal variance when p = 0.25
○ When p heads towards 0, the item variance gets smaller (when no one passes the item, the variance is also 0, AKA there are no individual differences)

50
Q

3 contexts for using p values

A

1- • General knowledge/ability test
○ If you’re constructing an ability/knowledge test used on a general population where you want to make the rate of passing about in the middle (p = more or less 0.5)
The items should have p values of about 0.3-0.7 so that they average to about 0.5 (therefore an average responder will have a score in the middle of the distribution)

2- • Goal: out-of-level testing (the purpose of administering the test is to identify those who have the highest level of skill/knowledge, you’re not too worried about the rest)
○ Ex: awarding of scholarship, etc
○ Example: goal is to identify the top 20%
§ The average p value will be about 0.2 on average (the items are harder)
§ The resulting distribution will be positively skewed (80% will not pass the item (q), while 20% will pass it (p))
§ The range of p values for individual items will be about from 0.1 to 0.3 so that it averages to 0.2

3- • Goal: diagnostic testing (opposite scenario of out-of-level; we want to identify those at the lower end of the distribution)
○ Ex: a school system making a tutoring program where they can accomodate 20% of their students only - we will want to identify the students in the lower 20% of the distributions of grades
○ Called compensatory attribution
§ The average p value will have to be about 0.8
§ The distribution will be negatively skewed
§ The test will be relatively easy and about 80% will pass and 20% will not

51
Q

Item discrimination index (CTT)

A

difference in the passing rate for each individual item over 2 groups defined as
• D = p (U) - p (L)
• Item difficulty for upper group - item difficulty for lower group (AKA the p value of the higher group should be higher than for the lower group)

• *D is NEVER L - U, ALWAYS U - L
52
Q

Construct-relevant distinction

A

• The upper group should have a higher standing/skill/performance on a construct related to what the test is supposed to measure than the lower group

53
Q

2 ways to define the U and L groups to calculate D

A
  1. Some distinction external to the test (group membership on some variable that should be relevant for what the test is measuring, and upper/lower is determined by the score on that external variable) - pre-defined groups existing outside of the test
    2. Internal distinction: groups are defined NOT b something external to the test, but by the basis of total scores on the test (internal measure), and this score is the only thing determining upper/lower
54
Q

Kelley’s idea about the U and L groups

A

What is a higher score vs a lower score: you can take the 27% highest scores and 27% (any proportion basically)
Kelley (1939) made the observation that if the lower group and higher group are defined as higher/lower 27% of the scores, then it is this split that will give you overall the highest value of the variance for D statistics
• It’s measuring a wider range of individual differences
Optimal definition for an internal distinction between upper and lower

55
Q

Why is rit the most flexible item level statistic? (between p, D and rit)?

A

Unlike p and D, can accomodate a wider range of response format
• Can be calculated for:
○ Dichotomous (0-1) - p and D too
○ Likert response scale - not p and D
○ Partial credit metric - from 0 to 0.5, 0.75, etc - not p and D

56
Q

What does it mean if rit is positive?

A

If rit > 0, it means that overall scores will tend to be generally higher (increasing rij and a as well) (higher scores on the items usually go with higher scores on ALL the items)
Item 1 should then be positively correlated with item 2, 3, 4 etc

57
Q

If a correlation is still negative after correcting for wording, then…

A
  • Getting higher points on a particular item means having an overall lower score on the test - something is wrong, this item presents a problem (maybe it’s not written clearly, maybe it’s from hte wrong domain)
    • Items with negative correlations with total score will also tend to have negative correlations with other items in the test
      • It will draw rjj and a down (in the same way positive correlations brought them up)
58
Q

What effect can deleting an item with a negative correlation (rit) have on alpha?

A

Deleting an item wiht a negative correlation with total score might increase alpha for the remaining items although you just shortened the test (opposite of what S-B prophecy would predict, but removing an item with bad psychometric properties can actually increase the reliability)

59
Q

What does it mean if total item correlations are “corrected”?

A

meaning that for that particular item, the procedure calculates the total score across the other items but not item 1 (for item 1, for example) - the total score excludes the contribution of that particular item

60
Q

What happens to alpha if you remove an item with a high rit correlation from a test?

A

It drops significantly (also depending on the nbr of items on the test, little items = bigger drop)

61
Q

What happens to alpha if the item with the lowest (not negative) rit correlation is removed from the test? What if that item was negative?

A

Alpha does not change
If an item had a negative correlation, alpha would probably go up if we removed it because the items disagrees with the general tendency of the test

62
Q

If we ask a computer for the average score of each item, and the scoring of those items is dichotomous (0-1), what will the average look like? What does this number correspond to?

A
  • For a binary item like correct/incorrect, 0/1, etc
    • The mean will be a proportion of 1 (from 0-0.01 to 0.99-1)
    • The proportion of cases who have a score of 1 (ex: average=0.3, 30% of answers to this item were coded as 1, “right”)
    • The average = the proportion passing (AKA p value for each item, AKA item difficulty)
63
Q

What does a higher item difficulty means in terms of the difficulty level of that item?

A
  • Higher numbers = easier items

* Lower numbers = tougher items

64
Q

3 item statistics of classical vs modern test theories

A
Classical: 
	• Difficulty (p)
	• Discrimination (D)
        • Item total correlation (rit)
Modern: 
	• Difficulty (NOT the same as p in CTT)
	• Discrimination (NOT the same as D in CTT)
        • Guessing
65
Q

2 requirements for item response theory approach analyses that are computer intensive

A

1- You need large sample sizes - in the thousands

2- You need large banks of items - in the hundreds

66
Q

Most examples of IRT are in which domain?

A

Most examples of IRT testing are in government testing - screening thousands of applicants - applications to companies, programs, etc

67
Q

Explain the general spirit/mindset of IRT/LTT

A

IRT is all about trying to relate the probability of getting an individual item correct as a function of latent ability
• Or the estimates the status of an indiviudal on the underlying construct/ latent variable
• DOES NOT deal with simple total scores
Unlike factor analysis it tries to locate the standing of each person on a factor and tries to predict/generate what level of ability is required to pass an item

68
Q

What are ICCs?

A

an s-shaped function (sigmodial function)
In IRT analyses, for each item in the bank, the computer can generate its own ICC
Go check example of ICC in notes

69
Q

How to locate difficulty in the iCC? (IRT)

A

Difficulty: level of ability needed to have a 50% chance of passing the item

• In the ICC:
○ Locating, on Y axis (probability of correct response), 0.5 and project a horizontal line until we reach the ICC curve
○ Drop down to the x axis to the corresponding level of ability (on the X axis, scores are in Z scores (0=average ability))

70
Q

How to locate discrimination in the ICC? (IRT)

A

• Discrimination: How the slope of the tangent line at the ICC that corresponds to the difficulty to the item
○ (the dashed line is the discrimination line)
○ The steeper the line the more discriminant the test is, and the change in probability of getting the item correct increases faster with the increase of ability level
○ Higher discrimination is a desirable characteristic of an item
○ Discrimination: how quickly does the predicted probability of getting an answer correct increases with the ability level

71
Q

Guessing in the ICC (IRT)

A

• Guessing: relevant only for items where its possible for examinees who don’t know anything about the subject to guess a correct answer
○ For some items, guessing is not a possibility (ex: math exam)
Value of the ICC curve at the lowest ability level on the graph (virtually no knowledge) (in the example above it’s -3 SD but it could be lower on some graphs)

72
Q

Tailored testing / personal reliability coefficient

A

Unlike CTT, where the same test is given to everyone in the sample, in MTT the items are selected for each person and are tailored to the examinee
• For example, the first item will probably have an item with 50% difficulty level
• If the person gets it, they will be given a tougher item, until they get one incorrect, where they will lower the level of difficulty until the person passes half and fails half of the items given to them, it will correspond to their ability level - the computer will make them answer questions of their ability level until they answer enough questions to have a sufficient level of reliability
○ The computer will calculate for each examinee, their personal reliability coefficient (person-level rxx)
• In CTT, there is no way to calculate a reliability coefficient for each examinee (or it would be very difficult)