Test Construction Flashcards

1
Q

Item Characteristic Curve

A

A graphical representation of test item’s difficulty, discrimination, and chance of false positive. Difficulty (degree of attribute needed to pass item): indicated by position of curve on the X axis. Discrimination (ability to differentiate between high and low scorers): indicated by slope of the curve. Chance of false positives (probability of getting answer correct by guessing): indicated by the Y-intercept of the curve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Criterion-Related Validity Coeffecient

A

A value that indicates strength of a correlation between test scores and performance on a chosen construct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Test Characteristic Curve

A

A graphical representation of the expected number of test items a participant answers correctly versus the constructs measured by the test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Item difficulty

A

AKA item difficulty index or ‘p’. Defined as the percentage of examinees that answer the item correctly (how much of the attribute and individual must possess to pass the item).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the item difficulty (p) ranges?

A

0 and 1. 0 menas that no one passed the item (too hard) and 1 means that everyone passed (too easy). Average item difficulty should be 0.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

With item difficulty, what are the floor and ceiling effects?

A

Floor effects refers to a test’s ability to distinguish people at the low end of a distribution, while ceiling effects refers to a test’s ability to distinguish people at the high end of a distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is item discrimination?

A

The ability of the item to unambiguously separate out those who fail from those who pass. Can be visually represented with discrimination as the slope of the curve. Steeper slopes indicate more discrimination.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is item discrimination assessed?

A

Index D (item discrimination index): difference between the proportion of low-scoreers who answered the item correctly and high-scorers who answered the item correctly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the D ranges?

A

1 to -1; it is desirable to have positive values of D, which would indicate that more high-scoring examinees (rather than low-scoring examinees) answered the item correctly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Ratio measure

A

A level of measurement describing a variable with attributes that have all the qualities of nominal, ordinal, and interval measures as well as a true zero point; measurement of physical objects is an example of ratio measure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Interval measure

A

A level of measurement describing a variable whose attributes are rank-ordered and have equal distances between adjacent attributes with no true zero point; the Farenheit temp scale s an example of this, because the distance between 17 an 18 is the same as the distance between 89 and 90

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Nominal scale

A

A variable whose attributes are simply representations for groups and have no ranked relationship; gender would be example of a nominal scale of measurement because male does not imply more gender than female.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Item Response Theory

A

IRT focuese on determining specific parameters of test items. Makes use of characteristic curves, which provide info about item difficulty, item discrimination, and the probability of false positives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Assumptions of IRT

A

Single underlying trait, relationship between trait and item response can be displayed in item characteristic curve, and requires large sample size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Computer Adaptive Assessment

A

Uses IRT; customizes test to the examinee’s ability level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Classical Test Theory

A

CTT; AKA Classical Measurement Theory, is an approach to testing that assumes that individual items are as good a measure of a latent trait as other items; thus, CTT focuses on the reliability of a set of items. in CTT, item and test parameters are sample’dependent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Kappa Coefficient

A

Measured the degree to which judges agree. Measure of inter-rater reliability. Increases when raters are well-trained and aware of being observed. Applicable only with nominal, ordinal, or discontinuous data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Ranges of Kappa Coefficient

A

-1 to +1; .80 - .90 indicates good agreement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Convergent Validity

A

Indicates the degree of correlation between two instruments that are intended to measure the same thing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Metric Data

A

A term used to refer to interval/ratio data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Continuous Data

A

A term used to refer to interval/ratio data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Internal Consistency

A

A measure indicating the extent to which items within and instrument are correlated to each other; internal consistency indicates the extent to which the given items measure the same construct

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Kuder-Richardson Formula 20

A

A method of evaluating internal consistency reliability; used when test items are dichotomously scored; used when test items vary in difficulty; indicates the degree to which test items are homogenous; falsely elevates internal consistency when used with timed tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Single-Subjects Designs

A

One or more participants and are focuses on assessing variables within and individual rather than between individuals. They are ideographic (differences within a participant) rather than nomothetic (differences between participants)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

2 types of single-subject designs

A

Case study (describes an individual by using tests or naturalistic observation) or experimental (determine how the introduction of a factor affects behavior)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Problems with single-subject designs

A

Autocorrelation (when measured on the same variable multiple times, the variable becomes correlated with itself); Time-intensive (multiple assessments or intense observations are time-consuming); Generalizability (may not generalize); Practice effects (scores may increase from practice)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Nomothetic

A

An approach to personality that focuses on groups of individuals and tries to find the commonalities between individuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Multicollinearity

A

Very high multiple correlations among some of all predictors in an equation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Quantitative Research

A

Systematic empirical exploration or relationships; deductive, rather than inductive. Involves the collection and statistical analysis of quantitative data, whose results can often be generalized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Reliability

A

Refers to the consistence or repeatability of data; pertains to quantitative research

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

ANOVA

A

Test for differences in the mean scores of groups based on one or more variables. DV must be continuous and IV must be categorical. Tests the null hypothesis that the means of the group are equal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

ANOVA assumptions

A

Independence of observations (each participant in only one cell); Normality (distribution of scores cluster around the mean with fewer observations fallen farther from the mean; AKA bell-chaped curve); and Homogeneity of Variance (variance of every group is same as variance of every other group, AKA homoscedasticity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

2 types of ANOVA

A

One-Way ANOVA (test the main effect of one IV); or Two-Way ANOVA (tests main effects of first IV (A), second IV (B), and the interaction of the two (A*B))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Interaction effect (ANOVA)

A

The effect of one IV on the DV differs depending on the level of the other IV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are F-ratios?

A

Ratios of effect variance to error variance. In One-Way ANOVA, there is one F-ratio of the effect of the IV. In a Two-Way ANOVA, there are three F-ratios (main effect A, B, and interaction effect)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Advantages of Two-Way Anova over One-Way Anova:

A

Includes interaction effects; increases power; reduces familywise error rate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Heterogeneity

A

The violation of the assumption of homogeneity, such that the variances of the groups are not equal. ANOVA is robust to such a violation, if there are no outliers, sample sizes are large and fairly equal, sample sizes within levels are relatively equal, and the hypothesis is two-tailed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Chi-Square Test

A

Statistical method of testing for an association between categorical variables; specifically, it tests for the equality of expected and observed frequencies or proportions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

MANOVA

A

An extension of ANOVA methods to cover cases where there is more than one DV and where the DVs cannot simply be combined. The MANOVA combines the DVs in such a way as to maximize differences between groups. In addition to identifying whether changes in the IV have a significant effect on the DV, the technique seeks to identify the interactions among the IVs and the DVs, if any.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

ANCOVA

A

A general linear model with one continuous DV and one or more IVs, plus a covariate. ANCOVA is a merger of ANOVA and regression for continuous variables. ANCOVA test where IVs have an effect after removing the variance for which one of more covariates account; the inclusion of covariates can increase statistical power because it accounts for some of the variability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Dichotomous/Continuous Variables

A

Continuous variables assume an intermediate value between two other values and there can be an infinite amount of possible values between those two values. Dichotomous variables have only two values (yes or no)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Point-biserial correlation

A

Examines the relationship between a dichotomous variable and a continuous variable. Can only be used with TRUE dichotomous variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Biserial correlation coefficients

A

Examine the relationship between an artificially-created (made form a continuous variable) dichotomous variable and a continuous variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Spearman’s rho

A

This correlation coefficient is used when measuring the relationship between two ranked variables giving a rank-order correlation.

44
Q

Pearson’s r

A

This correlation coefficient is used when measuring the relationship between two continuous variables

45
Q

Eigenvalue

A

Measures the amount of variance in a set of tests or items that can be accounted for by an underlying factor. Used in factor analysis and principal components analysis. Often converted into percentages to determine percentage of variance in a set of test items accounted for by an underlying factor. Factor analysis will provide same # of eigenvalues as there are items or tests.

46
Q

What do large eigenvalues indicate?

A

An underlying factor is explaining a large amount of variance in a set of items or tests.

47
Q

Inferential statistics

A

Deal with formulating conclusions and making inferences from collected data

48
Q

Multiple regression

A

A manner of regression analysis in which one or more predictor variables are used to predict a single criterion variable.

49
Q

Factor Analysis

A

A statistical technique that identifies underlying patterns in a data set.

50
Q

Goals of factor analysis

A

Identify underlying factors that are responsible for variation in a set of items, variables, or tests; Reduce a large set of variables to a smaller number of underlying factors.

51
Q

What are factor loadings?

A

Produced by factor analysis (along with eigenvalues) and provide a measure of the correlation between an item and an underlying construct. Higher factor loadings indicate that the underlying factor is accounting for a large amount of variance in the item.

52
Q

When is factor rotation used?

A

To aid in interpretation of factor loadings from a factor analysis

53
Q

Discriminant analysis

A

A statistical method utilized to predict group membership.

54
Q

Path Analysis

A

A correlational technique that tests directional hypotheses among multiple IVs and multiple DVs simultaneously.

55
Q

Moderator Variable

A

Changes the relationship between a predictor and a criterion variable; equivalent to an interaction effect in ANOVA; background variables such as gender, and SES, are common moderators. When moderator variables are present, a test has differential validity (validity differs depending on the level of the moderator, such as whether one is male or female)

56
Q

Cross-Validation

A

Administering a test to a new sample, one that is different from the original validating sample, so as to evaluate the test’s validity on another sample of subjects

57
Q

Criterion Contamination

A

A misleading increase in a test’s validity, in which raters give subjects scores on the criterion variable after being privy to the subject’s scores on the predictor variable.

58
Q

Discrete variable

A

A variable that is measure on either nominal or ordinal scales

59
Q

Shrinkage

A

A result of corss-validation in which there is a decrease in the validity coefficient due to sample differences.

60
Q

Confounding variable

A

A variable that affects the dependent or criterion variable, but is of now interest to the researcher

61
Q

One-tailed test

A

AKA directional test; test for rejection in only one tail; greater chance of rejecting null hypothesis

62
Q

Two-tailed test

A

AKA non-directional test; tests for rejection in both tails; able to reject null hypothesis in both tails, but each tail has a greater chance of rejecting the null hypothesis

63
Q

T-Score

A

Standardized score that allows for a participant’s score to be compared to the norm group. Mean of 50 with a standard deviation of 10

64
Q

z-score

A

Mean of 10 and standard deviation of 1

65
Q

What are the %s and T-scores for one standard deviation from the mean?

A

68% of scores, or T-score between 40-60.

66
Q

What percentage of scores and the T-scores fall within two standard deviations from the mean?

A

95% or T-scores between 30-70

67
Q

Stanine Score

A

Mean of 5 and standard deviation of 2.

68
Q

Trend Analysis

A

An extension of ANOVA. Identifies trends in data when the IV varies from highest to lowest. (Ex. if one group is given 5 mg of meds, a 2nd group gets 10 mg and a third group is given 15 mg.

69
Q

Linear trend

A

Means are arranged in a line

70
Q

Quadratic trend

A

Means arranged in a U shape

71
Q

Cubic trend

A

Means arranged around two points of inflection

72
Q

Quartic trend

A

Means arranged around three points of inflection

73
Q

Quintic

A

Means arranged around four points of inflection

74
Q

Multiple correlation (R)

A

Deals with the correlation between an optimally weighted linear combination of predictors and a criterion. (Multiple regression deals with defining optimal weighting and is a test of prediction).

75
Q

Eta (n)

A

A universal measure of relationship that can be used regardless of the form of the relationship; it is obtained by computing the variance in Y about any curve of the relationship. Eta is a universal measure of relationship because it (1) applies regardless of the form of the relationship, (2) can be used with either a predicted cure of a relationship or a best-fitting curve obtained after the data are collected, and (3) applies equally well to continuous or categorical independent variables.

76
Q

Type I Error

A

Rejecting the null hypothesis when it is true. Usually set to 0.05 in the social sciences

77
Q

Type II Error

A

Failing to reject the null hypothesis when it is false. Related to Type II error is power (1-B), the probability of rejecting the null hypothesis when it is false.

78
Q

Type III Error

A

Rejecting the null hypothesis, but for the wrong reason. Because of sampling error, two groups can be correctly identified as being significantly different, but the direction of the difference is the opposite of reality. These are relatively rare.

79
Q

As Type I Error becomes smaller,…

A

Type II Error becomes larger

80
Q

Power is impacted by ____?

A

Sample size (larger samples increase power); Alpha (smaller alpha levels decrease power, e.g. 0.01 or 0.001 rather than 0.05); Effect size (greater effect sizes increase power, in other words, larger difference between the two groups); and Test used (different statistical tests have more power, two-way ANOVA is more powerful than a one-way ANOVA)

81
Q

Criterion-References Test

A

Compares the test-taker’s performance to an objective standard of achievement. Can be Domain-Referenced (examines the degree to which the test taker has mastered a specific area) or Objectives-Referenced (examines the degree to which the test taker has achieved instructional objectives)

82
Q

Norm-Referenced Test

A

Compares test-taker’s performance to other test-taker’s performance. Requires large standardized sample that is representative of population.

83
Q

In multitrait-multimethod matrix, convergent validity is evidenced by ____?

A

High correlations between measures of the same trait

84
Q

In multitrait-multimethod matrix, divergent validity is evidenced by ____?

A

Low correlations between measures of different traits

85
Q

Monotrait-Monomethod

A

Correlation between two tests that measure one trait using one method

86
Q

Monotrait-Heteromethod

A

Correlation between two tests that measure one trait using different methods

87
Q

Heterotrait-monomethod

A

Correlation between two tests that measure different traits using one method

88
Q

Heterotrait-heteromethod

A

Correlation between two tests that measure different traits using different methods

89
Q

Nomological Network

A

Developed by Cronbach and Meehl (1995) stating that in order to prove that a given measure had construct validity, a “lawful network” for the measure had to be developed; this network includes the theoretical framework for what the instrument is attempting to measure (the construct), an empirical framework for how the construct will be measured (observable manifestations), and the interrelationships among and between the the two frameworks.

90
Q

Predictor variable

A

Synonymous with IV; the variable that is sued to predict variance in the criterion; plotted on the X axis

91
Q

Criterion variable

A

Synonymous with DV; variance of the criterion is predicted by the predictor; potted on the Y axis

92
Q

Assessment of relationship between the predictor and criterion

A

Beta (B) weights (strength of a predictor when all other predictors are held constant), R2 (unique predicitive strength of a predictor); Zero-order correlation (relationship between predictor and criterion ignoring all other predictors); Multicollinearity (i.e. highly correlated predictors, may not reduce predictive ability of predictors)

93
Q

Validity coeffeicient

A

Correlation between predictor and criterion; squared validity coeffecient indicates the proportion of variance in criterion that is accounted for by the predictor; Greater ranges of scores in both predictor and criterion increases validity coefficient, restricted range decreases validity coefficient; Few validity coeffecients exceed 0.60

94
Q

Conceptual criterion

A

Theoretical standard that researchers seek to understand

95
Q

Actual criterion

A

Operational or actual standard that researcher actually assess

96
Q

Criterion deficiency

A

Portion of the conceptual criterion that is not measured by the actual criterion

97
Q

Criterion relevance

A

Degree of overlap between the actual criterion and the conceptual criterion

98
Q

Composite criterion

A

Available criterion measure is a composite of separable attributes

99
Q

Criterion of discrimination

A

A criterion that inaccurately differentiates between groups, resulting in majority of members being overrepresented in comparison to minority groups

100
Q

True positive

A

The number of individuals in a given group who exceed cutoff on both predictor and criterion

101
Q

False positive

A

The number of individuals in a given group who exceed cutoff on predictor but fail to exceed cutoff on criterion.

102
Q

True negative

A

Number of individuals in a given group who fail to exceed cutoff on both predictor and criterion

103
Q

False negative

A

The number of individuals in a given group who fail to exceed cutoff on predictor but exceed cutoff on criterion

104
Q

Measurement error

A

Error in the employed values of a variable due to the presence of distorting influences on the assessment, such as momentary distractions, error in recording or understanding, and influences of other variables on responses to particular items. These are uncorrelated with the “true scores” by definition and treated as “random”. A reduction of the correlation coefficient because of error is known as shrinkage.

105
Q

Threats to internal validity

A

History (any event between pretest and posttest), maturation (natural changes in participants), testing (practice effects), mortality (dropping out), selection, regression effects, demand characteristics

106
Q

AB research design

A

Simples version of this design in which a baseline (A) is tracked, and then some treatment (B) is applied; if there is a change then the treatment is said to have effect. Weak design because it is subject to many different hypotheses.

107
Q

Time Series Design

A

AKA quai-experiemental design, refers tot he pretesting and posttesting of one group of subjects at different intervals. The purpose might be to determine long-term effects of treatment, and, therefore, the number of pretests and posts can vary from one to many. Sometimes there is an interruption (follow up test) to assess strength of treatment over time.

108
Q

Correction for Guessing Formula

A

Usually used for multiple choice exams; Corrected score = R-W/(n-1). R is the number of right answers obtained, W is the number of wrong answers, and n is the number of possible answers per question.