Test Construction Flashcards

1
Q

A test’s reliability or true score variability can not be measured directly but must be estimated. In order to estimate a test’s reliability what must be assessed?

A

consistency of scores over time, across different content samples, and across different scorers and is based on the assumption that variability that is consistent is True Score Variability while variability that is inconsistent reflects measurement (random) error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Most methods that measure Reliability produce a Reliability Coefficient which is a correlational Coefficient that ranges in value from? when this is 0 it means that scores are due to?
When test retest reliability is 1, this means that? and the rxx-test measured against it’s self.
if a reliability coefficient is .84 that means that % of variability in scores is due to true score differences while % is due to measurement error?

A

0.0 to 1.0 (0- measurement error and 1= true score difference)
Measurement Error
Variability in scores reflect true score variability:
Reliability coefficient = .84 84% true scores .16 or 16%; (.84-1.0) is measurement error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the methods for estimating reliability and how does it work? and what does it indicate? What is measurement error due to?

A

Test retest: administer the test to the same group on two different occasions and corelate the scores- indicates the degree of stability (consistency) of examinee’s scores over time–Coefficient of Stability. Measurement error is due to any factors that occur over time: random fluctuations in examinee’s over time and variation in testing situations.

  • Alternate (Equivalent, Parallel) Forms Reliability: two equivalent forms of the test are administered to the same group and two sets of scores are correlated- consistency of responding to two test forms (different item samples) when administered at different times- consistency of responding over time. Administered at the Same Time- Coefficient of Equivalence (equal). when over time- Coefficient of Equivalence and Stability (test retest). Measurement Error is due to content sampling interaction between scorer’s knowledge and different content assessed by the item. - best reliability but difficult to create equivalent forms
  • -Internal Consistency Reliability—Split Half Reliability and Coefficient Alpha two methods— Both involve administering the test once to a single group and both yield a reliability coefficient named the Coefficient of Internal Reliability
  • -Split Half-Spearman Brown Prophecy Formula
  • -Cronbach’s Coefficient Alpha- formula- determines the average degree of inter-item consistency (obtained from all possible spits–not so good, if scored dichotomously (right or wrong answers)- Kuder-Richardson Formula (KR-20)
  • -error is due to content sampling split (examinee’s knowledge better matches one half and not the other) Coefficient Alpha-content error-heterogeneity of the content domain- good for when test measure a single characteristic–can not use for Speeded Tests—
  • –Inter Rater Reliability (Inter-Scorer, Inter Observer) –depend on rater’s judgement–Correlation Coefficient and Percent Agreement- Corr. Coefficient- Kappa Statistic (Kappa Coefficient, Cohen’s Kappa)–Nominal or Ordinal Scale– and Coefficient of Concordance (Kendell’s Coeff. Concord.) (used when three or more raters)– Percent Agreement: dividing total number of items agreed upon by raters with total scores. Agreement might be due to chance alone. Error is due factors related to raters: consensual observer drifting.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What factors affect the reliability Coefficient?

A

Test Length_ longer the test the larger the correlation coefficient_ In addition to Split half Reliability-Spearman Brown can be used to determine best length
Range of Test Scores: maximized when examinees are heterogeneous and item difficulty.
Guessing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When a predictor’s reliability coefficient is .75, it’d criterion-related validity can be:

A

No Greater then the square root of .75

rxy< square root of rxx

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A screening test has a low base rate with an overall accuracy rate of 98%. When using this test to identify people in the general population it is good to keep in mind that?

A

Base rate are people selected without the predictor (test) and are successful on the criterion (behavior). Positive hit rate are those people who were selected based on the predictor scores and are successful on the criterion.
Ex. A Psychologist wants to use a short screening tool to substitute a lengthier one and wants to determine adequate incremental validity
predictor determines if a person is a positive or negative (on the screening tool) and the criterion determines if he/she is a “true” or “false” (lengthier tool) - use scatterplot
–When the predictor has a high accuracy rate 98% and the base rate is less then 50% then a Large number of False Positives then False Negatives will always be the case.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the different methods for establishing validity?

A

Content Validity: examinee’s familiarity with a content domain or behavior that it measures (ex: achievement tests) established by experts in the field:
Construct Validity: hypothetical trait
Criterion related validity ( to predict an examinee’s performance on an external criterion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some ways that you can test construct validity?

A

Convergent and discriminant validity: correlate test scores with scores on other measures that do and do not assess the same trait. convergent-relate discriminant- unrelate—
Factor analysis- another way to obtain information on a tests convergent and discriminant validity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a Multirait-Multimethod Matrix? and what does it assess?

A

a table of correlations that provides information about the degree of association between two or more traits that have be assessed using two or more methods.
-tests a test’s convergent and discriminant validity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A supervisor measures two trait that are unrelated assertiveness and aggression using two different measures, a test and supervisor ratings. She calculates the scores and put them in a ____.
What are the four types of correlation coefficients that will be created and what do they mean?

A

Multitrait-multimethod matrix
Monotrait-monomethod- is a reliability coefficient and measures a measure with it’s self. Not a measure of convergent/discriminat but needs to be high inorder to use the matrix.
Monotrait-heteromethod-large correlation- convergent
Heterotrait- monomethod coefficient- if small then discriminant validity.
Heterotrait-heteromethod- small-discriminant validity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

To ensure a work sample has adequate content validity you would :

A

You would make sure that skills (behavior) required by the work sample represents skill domain required by the job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do you need to construct a confidence interval around a obtained test score?

A

the examinee’s test score and the standard error of the measurement ( which is calculated from the test’s standard deviation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the difference between the standard error of measurement, the standard error of the estimate and standard deviation?

A

Standard error of the measurement: to construct a confidence interval you add and subtract on standard error of the measurement to and from the examinee’s obtained score- A measurement (((obtained Score)))
Standard error of a estimate is to construct a confidence interval around an estimated ((predicted score))) score. The 68% confidence interval is constructed by adding/subtracting 1 standard error from the predicted criterion score, 95% 2, 99.9% adding or subtracting 3 standard error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is internal and external validity? Is this used in the validity of tests?

A

Internal and external validity refer to the validity of research studies, not tests. A research study has adequate internal validity when its results allow a researcher to conclude that there is cause-effect relationship between the independent and dependent variables. A research study has adequate external validity when it allows the researcher to generalize conclusions about the cause-effect relationship to other people and conditions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Test of Statistical Significance assess ?

A

How likely the difference between groups are due to sampling error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the steps for a factor analysis?

A

tests construct validity plus
administer several tests
use factor analytic techniques to convert data into a correlation matrix,
Simplify the interpretation and naming of the factors by rotation
–there are two type of rotations- orthogonal 90 degree angel
Oblique not 90 degree angle
interpret and name factors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When a test has high sensitivity it means that there is

What is specificity?

A

When the sensitivity is high, this means that most of the people with the disorder will be identified as having the disorder by the test (i.e., there will be few false negatives) but that there will be some people without the disorder who will also be identified as having the disorder (i.e., there will be some false positives).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Criterion-related validity used when?

What is the predictor and what is the criterion?

What are the two forms of criterion related validity?

A

test scores are to be used to draw conclusions about an examinee’s likely status on another measure–to insure that an employee selection test can actually predict how well an applicant will do on a measure of job performance after she is hired.
Test-predictor and Other Measure is the criterion
–Concurrent-criterion data is collected at the same time or prior to the predictor. (estimate current status)
–Predictive validity- criterion is measure some time After. (predict future occurrence)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is shared variability and when can you square a correlation coefficient?

A

Shared variability represents a correlation between two tests that is squared
square the criterion related validity coefficient.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

When an exam question gives you the correlations coefficient for variable x and variable y and asks how much variability in variable Y is __explained or accounted for___ by variable Y, you will do what to answer the question correctly?

A

Square the correlation coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

A psychologist conducts a content validity study by administering assertiveness test (predictor) to 100 salespersons and determined their average monthly sales over 3 m. (criterion). She correlated their test scores with sales and obtained a validity coefficient of .60. This means that __ variability in sales is accounted for by differences in a assertiveness. While the remaining __ is due to other factors

A

36% (.6 X .6) square 6 = 36= 36%

64% (1.00 - .36) 100-36= 64

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Convergent and discriminant (divergent) is associated with ___ and concurrent and predictive is associated with___ And concurrent and predictive can also assess a predictors ___ which is ____ and is evaluated by looking at the data on a _____

A

construct validity
Criterion related validity
incremental validity - correct scores that can be expected if the predictor is used as a decision making tool
Scatter plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are these?

Communality
Factor loading
principal component
eigenvalue

A

Communality Each test included in a factor analysis has a communality, which indicates the total amount of variability in test scores that has been explained by the factor analysis - i.e., by all of the identified factors.

The factor loading is the correlation between a single test and an identified factor.

In principal components analysis (which is similar to factor analysis), the principal component is equivalent to a factor.

In principal components analysis, the eigenvalue indicates the total amount of variability accounted for by each component (factor).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Criterion-related validity refers to the relationship between _______ and a __________measure.

A

test scores; criterion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

______________ is a type of construct validity. Construct validity refers to the degree to which a test measures a theoretical construct. The establishment of what type of validity may include correlating scores on the test with scores on another test that does not purport to measure the same or a related construct. If the correlations are low, evidence that the test measures the construct it purports to measure (and not the construct the other test purports to measure) is provided. (This is referred to as?

A

Discriminant validity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Content validity refers to the degree to which a test samples the content domain it purports to sample. For example, an algebra test with many questions about calculus has low content validity. Content validity, a concern with most academic tests, as well as with the Psychology Licensing Exam, is primarily subjectively determined by who? in the given content domain.

What is construct validity?

A

experts

Construct validity refers to the degree to which a test measures a theoretical construct.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

An item characteristic curve (ICC) indicates:

A

the relationship between the likelihood that an examinee will endorse the item and the examinee’s level on the attribute measured by the test.
As its name suggests, an item characteristic curve (ICC) provides information about an item’s characteristics. It is part of the Item Response Theory and is constructed for each item. The curve provides information on the relationship between an Examinee’s Level on that ability or trait and the Probability that she will respond correctly to that item.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

The internal consistency of test items can be assessed with? what are the possible errors? when is it useful?

Kappa Statistics is used to measure what?

Alternative forms reliability provides information on what? and is used when ___________

What provides information of the consistency of a test over time?

A

Split Half reliability (Spearman Brown), Cronbach’s coefficient Alpha (conservative/lower level) Kuder-Richardson Formula 20 (when test items are scored dichotomously). Error results in content sampling error (difference content between two halves of the test and Coeff. difference between individual test items– heterogeneity of test items). Useful when Measuring 1 single characteristic.

inter-rater reliability: scores or ratings represent an nominal or ordinal scale of measurement.

the equivalence of items contained in two different forms of the test.

reliability coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

When would you use the correction for attenuation formula?

A

to estimate what a predictor’s criterion-related validity coefficient would be if the predictor and or or criterion had a reliability coefficient of 1.0

30
Q

Multiple regression analysis is a

A

Metric statistical technique used to analyze the relationship between a single dependent variable and several independent variables
(1) DV and (several) IV

31
Q

In the context of factor analysis, “specificity” refers to:

A

the proportion of variability in a test that has not been explained by the factor analysis.

32
Q

When tests scores are added to a percentile score, what is important to know?

A

Remember that in a normal distribution, more scores are close to the middle of the distribution (e.g., 52nd percentile) than at the tail of the distribution (e.g., 14th percentile). So by adding five points to Nancy’s raw score, she will surpass more people because her score was close to the middle of the distribution. In contrast, by adding five points to Pete’s raw score, it will have less impact on his relative standing because there are fewer scores at the tail of the distribution. As a result, Nancy’s percentile rank would increase more. (A good mnemonic for remembering this is “more movement in the middle.”)

33
Q

What type of test would have the lowest reliability? true false, multiple choice or free recall?

A

A test’s reliability is an index of the degree to which scores on the test are free from random error (chance factors) and indicative of examinees’ true scores. The greater the probability that an examinee can answer an item correctly by guessing (i.e., by chance), the lower a test’s reliability.

On a true-false test, the probability that the correct answer can be selected by chance is 50% (1/2). In other words, there is a higher probability that an examinee can guess a correct answer on a true-false test than on the other tests, and thus, an examinee’s score on a true-false test will reflect, to a greater degree, error rather than his/her true score.

34
Q

In order to assess a tests construct validity is to correlate test scores with scores on measures that do and do not purport to assess the same trait. High correlationships with measures on the same and related trait provide evidence of what?
Low correlations with measures on unrelated characteristics provide evidence of what?

A research who wants to create a measure of self esteem, might also use measures of self-worth, confidence, social skills, and self appraisal to assess what type of validity?
To do this, a research would use a ______ that will help him/her to systematically organize the data.

A

Convergent validity V
Divergent Validity ^

Convergent

Multitrait-Multimethod Matrices

35
Q

if a test being validated is a self-rating scale of self-esteem and it has a high correlation with a previously validated parent rating scale of self-esteem, this would suggest that the test being validated is actually measuring self-esteem. If the researchers are using a Multitrait-Multimethod matrix, this is an example of what type of Coefficient and what does it indicate?

Of this matrix, what makes up a reliability coefficient? and should they be large or small correlations?

if the test being validated is a self-rating scale of self-esteem and it has a high correlation with a self-rating scale of neuroticism, this would suggest that the measure of self-esteem may be measuring something other than self-esteem since these two traits should not correlate. This is a ___ coefficient and means? what size indicate divergent validity?

A

A large monotrait- heteromethod coefficient provides evidence of a test’s convergent validity. Large monotrait-heteromethod coefficient

-monotriat-monomethod coefficient (a test and it’s self) need to be large

Heterotrait-Monmethod coefficent: indicates the correlation between two similar methods that measure different traits. A large heterotrait-monomethod coefficient indicates that a test has a lack of divergent validity and a small indicates it has divergent validity.

36
Q

A test’s communality is interpreted directly as the amount of variability in test scores explained by the identified factors. Therefore, when the communality is .50, this means that ___% of variability in test scores is explained by the identified factors.

A

50

In factor analysis, a test’s communality indicates its “common variance,” or the amount of variability in test scores that is explained by the factors that the test shares in common with the other tests included in the analysis.

37
Q

A factorial analysis is another way to obtain information about a test ___ or ___ validity.

What are the 5 steps to conducting a factor analysis?

The numbers on the last column of the factor matrix are the ______, which indicates a “__________.”

What indicates the total amount of variability in test scores that is accounted for by all the identified factors and is a component of a true score variability (reliability)?

Recall that a reliability coefficient indicates the amount of variability in test scores that represents true score variability. In an analysis this is broken into two components communality and specificity. Specificity is what? Is this explained by the factor analysis?

A

Convergent : divergent validity Ulitimately construct validity

1- administer a number of menath with the same construct and different constructs to several groups of examinees
2- correlate scores with every other scores on other tests to obtain a correlation matrix
3- convert correlation matrix with factor matrix or factor loadings that will indicate the degree of association between each test and factor
4-simplfy by rotating : either orthogonal or oblique
5- interpret and name the factors.

Last column: communality: “common variance” or the amount of variability in test scores that is explained by the factors that the test shares in common with the other tests included in the analysis.

  • communality
  • specificity is variability that is due to factors that are unique or specific to the test and not measured by other tests in the factor analysis. –is not explained by the factor analysis because it is not related to anything
38
Q

A criterion-related validity coefficient is interpreted like any other correlation coefficient for two variables – i.e., it is _____ to obtain a measure of shared variability.

Thus a validity coefficient of .30 indicates that 90% (.30 squared) of variability is shared by the predictor and criterion – or, put another way, that % of variability in criterion scores is explained by variability in predictor scores.

A

SQUARED

39
Q

Inter-reliability can be evaluated by correlating the scores assigned by two or more raters, the ____ or by calculating __________.

A

kappa statistic

percent agreement

40
Q

What formula is also known as the prophecy formula and is used to estimate the effects of adding or subtracting items to a test on its reliability coefficient.

What is used to construct a confidence interval around an examinee’s obtained test score. The confidence interval indicates the range within which the examinees true score is likely to fall.
What is the Standard Error of measurement when the standard deviation is 10, test reliability coefficient is .84? If the applicant received a score of 80 then what would be his true score in a 95% confidence interval?

What is the formula for the standard error of the estimate? Therefore to construct a confidence interval around a predicted criterion score of 80, with a predicted average of 7,000 with a SD of 1,000 and a validity coefficient of .60? in this example what is the predicted estimated criterion score that the confidence interval needs to be constructed for?

A

Spearman-Brown formula

Standard error of measurement

10* (1- square root of .84) = 10( Change to 100 -84 = 16) = 10 ( square root of 16= 4) = 10(.4) = 4.0

Think bell curve 95% is 2 SD from the mean therefore
and the SD or standard error of the measure is 4 then 2(4) = or - 80 = 72 to 88

Again to interpret an examinee’s true score a confidence interval needs to be constructed. 95% confidence interval is usually used. the SEM gives an estimate range to which the score will fall.

SEE = SD (Square root of ( 1 - (r squared)))

SEE+ 1,000 (sq. root ( 1- (.6 squared )= 1,000 Sq Rt (1- (.36) turn into 100’s ) 1,000 * Sq. Rt. (100- 36) = 1,000 * Sq Rt ( 64) = 1,000 (8 or .8) = 800 then a 68% interval is 800 + or - 7,000
7,000 because it is the predicted score and we are estimating

41
Q

In a factor analysis” and orthogonal rotation is what? and an oblique is ?

A

An orthogonal rotation is used when the variables included in the analysis are believed to be uncorrelated. For example, if you conduct a factor analysis on 50 questionnaire items designed to measure a leader’s task- or person-orientation and you believe that these two orientations are independent (uncorrelated), you would perform an orthogonal rotation

An oblique rotation is used when the variables in the analysis are believed to be correlated. For example, if you conduct a factor analysis on three tests that are designed to measure verbal ability and three tests that are designed to measure nonverbal ability and there’s evidence that verbal and nonverbal ability are correlated, you would perform an oblique rotation

42
Q

the relationship between the likelihood that an examinee will endorse the item and the examinee’s level on the attribute measured by the test describes an items? The curve provides information on the&raquo_space;>

The ICC is part of what theory that makes up for the short fallings of the Classical test theory?

A

Item Characteristic Curve As its name suggests, an item characteristic curve (ICC) provides information about an item’s characteristics.

The curve provides information on the relationship between an examinee’s level on the trait or ability and the probability that he/she will respond correctly to the item.

Item Response Theory

43
Q

In addition to concurrent and predictive validity, incremental validity or the (increase in correct decisions that can be expected if the predicter is a decision making tool. To determine incremental validity you first want to create a ____. The criterion and predictor cutoff scores must be set.
How do you find incremental validity?

A

Scatterplot
(Actually successful but predictor say no/ Successful and predictor says yes )
False Negative ][ True Positive
________________________
True Negatives ][ False Positives (not successful and predictor says not/ Successful and predictor say not)

Incremental validity = (pos. hit rate- Base rate)
Positive hit rate= true positive/total positive
Base rate = (true positive + False Neg.)/total people

44
Q

Data used to evaluate a predictor’s accuracy can be expressed in terms of 6 validity indexes?

____ and ___ provide information about a predictor’s accuracy when administered to a group of individuals who are known to have or not have the disorder of interest.
_____ percent of people who Do NOT have the disorder and are accurately identified by the predictor as NOT having the disorder
_____ is the percent of people who have the disorder and are accurately identified by the predictor as having the disorder.

What is it called when the predictor provides information needed to estimate the probability that people have or do not have the disorder when they test positive or negative on the predictor?

What indicates the extent to which a positive or negative result on the predictor changes the probability that a person has the disorder that is assessed by the predictor?
The higher the LR, the ____ the probability that a disorder is present. The closer the LR- is to Zero the ___ the probability that the disorder is present.

A
Sensitivity
Specificity
positive predictive value
negative predictive value
positive likelihood ratio
negative likelihood ratio

Specificity _ Do not have and identified as not having (the Doctors specifically said that you do not have the disorder)

Sensitivity_ Have and identified as having (it makes sense that you have it everyone in your family is sick) Sensitivity refers to the proportion of people with the condition who are correctly identified by the test and is calculated by dividing the true positives by the true positives plus false negatives. When the sensitivity is high, this means that most of the people with the disorder will be identified as having the disorder by the test (i.e., there will be few false negatives) but that there will be some people without the disorder who will also be identified as having the disorder (i.e., there will be some false positives).

  • Positive and negative predictive values
  • Positive (LR+) and Negative likelihood ratios (LR-)

-greater: lower

45
Q

In Classical Test Theory, p represents the what? and ranges from ____ to _____ with what type of values representing the easier items? what value of p indicates that no examinees answered the question correctly? and what value of p means that all examinees answered the question correctly?

What values of p are retained?

What affects the optimal difficulty level?

What is the preferred difficulty level for true and false tests?

what is the equation for p?

What is Item Discrimination (D) and how is it calculated? This index ranges from ____ to _____. If all examinees in the upper group answered correctly then D = ?

According to the Classical Test Theory
X = T + E

A reliability coefficient of .70 indicates that 70% of variability in test scores is due to true score variability, while the remaining 30% is due to measurement error.

To find the Standard Error of Measurement ( estimate to which a examinee’s true score is likely to fall given her obtained score) is?

How do you find the communality on a rotated factor matrix when it is orthogonal (uncorrelated)?
How do you find communality when it is orthogonal

Factor loadings are corre coeff- indicates the degree of association between each test and each factor. How do you interpret a factor loading to determine the amount of variability in test scores that is accounted for (explained by) that factor?

How to you calculate the Standard Error of the Estimate? how do you determine a 95% confidence interval?

A

p is the item difficulty
ranges from 0 to 1.0

Larger values = the easier items
0 = no items were answered correctly
1 = all answered correctly–easier question

Moderate- .5 are retained

guessing alone impacts optimal level- for 1.0, .5 is likely guessing
the goal of testing determines the optimal level- corresponds to the portion of examinees to be selected.

For True and False: .50 is likely chance of getting the answer right so, .75 is optimal level. which is half way between 1.0 and .5

p= total number passing/ total number of examinees

(D) discriminates between the those who obtained the high and low scores on the entire test. To find this you need to
D = Upper score - Lower score
ranges from - 1.0 to 1.0
D = 1.0

X: examinee’s obtained test score
T: True Score
E: Error Component

SEM= SD * (square root) (1-r)
r: reliability coeff.

Communality = a test’s factor loadings can be squared and summed to calculate the communality (the amount of variability in test scores explained by the identified factors).

square each factor– Factor 1 (squared) = % of variability is explained by factor 1. repeat for factor 2

SEE = SD * (square root) ( 1- r squared)
95% confidence interval: adding and subtracting 2 SEE around the predicted (estimated) criterion score.

46
Q

When do you use the kappa statistic?

A

The kappa statistic is used to measure the consistency of ratings assigned by two raters when data are nominal or ordinal. (Note that some authors use the term “discontinuous” to refer to nominal and other discrete data - i.e., data that represents noncontinuous categories.) - the name (nominal) of your ordinary (ordinal) Kap.

47
Q

Raw scores are often transformed in order to simplify their interpretation, and these transformations can be either linear or nonlinear.

_____________ transformations alter the rank order and relative size of the distance between scores. Percentile ranks (which always have a flat distribution regardless of the distribution of the raw scores) are an example of a nonlinear transformation. Logarithms and square roots are other types of nonlinear transformations.

___________ transformations preserve the rank order and relative size of the distance between scores

A

Nonlinear

Linear

48
Q

The validity of a test may not be _____ than the reliability index.

test’s criterion-related validity coefficient cannot _____ the square root of its reliability coefficient.

A

higher: A test to be valid, has to be reliable. A test which possesses poor reliability is not expected to yield high validity.

Higher (exceed)
For criterion related validity –r is squared = shared variability- (accounted for; explained by)
– a test reliability always places CEILING ON VALIDITY- WHEN A TEST HAS LOW RELIABILITY IT CANNOT HAVE HIGH VALIDITY, HOWEVER, HIGH RELIABILITY DOES NOT guarantee validity

49
Q

What is the relationship between communality and a test’s reliability?

A

communality = variability between common test factors: true score variability
Test’s reliability will Always be at least as large as it’s communality. The communality is the lower limit of a test’s reliability Coeff.

50
Q
To determine if a student has benefited from an educational program, you would most likely want to determine how much of the information presented in the program has been retained and or or to what degree participation in the program has improved the individual's performance on a task. What score would be most useful?
\_\_\_\_\_\_ scores (e.g., standard scores, percentile ranks) tell you how well an examinee is doing compared to other examinees.

____________ scores are a type of norm referenced score.

______scores indicate the relative strengths of the different characteristics measured by a test for the individual and would be less useful than criterion-referenced scores for the purpose described in the question.

A

Criterion-referenced scores tell you how well an examinee did in absolute terms (e.g., how many questions he or she answered correctly) and, therefore, would be most useful for the purpose described in the question.

Norm-referenced

Standard

Ipsative

51
Q

A percentage score is one type of criterion reference score and indicates the percentage of test content that the examinee answered correctly.
This method of score interpretation is often employed in?

What is another type of criterion referenced interpretation?

A

method of interpretation is often employed in mastery (criterion-referenced) testing which involves specifying the terminal level of performance required for all learners and periodically administering the test to learners to assess their mastery.

Another type involves interpreting examinee’s test scores in terms of their likely status on an external criterion – use a regression equation or an expectancy table

52
Q

What is a principal component analysis?
The varibles are call what:

Are principal components oblique or othogonal?

How do you calculate eigenvalues?

A
  • to identify a set of variables that explains all (or nearly all) of the total variance in a set of test scores. reduce the number of variables of a data set, while preserving as much information as possible. Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

The variables are called: principal component or eigenvestors and is defined as a linear combination of the set of tests that describes as much of the intercorrelations between the tests as possible. Turn to line. Principal components are extracted so that the first component accounts for the largest amount of variability in test scores; the second component accounts for the second largest amount of variability, etc. (make things simple)

Orthogonal: each component accounts for a unique amount of variability in test scores

To calculate a eigenvalue: must sum the squared correlations between that component and each test: chart

53
Q

A 200-item test that has been administered to 100 college students has a normal distribution, a mean of 145, and a standard deviation of 12. When the students’ raw scores have been converted to percentile ranks, Alex obtains a percentile rank of 49, while his twin sister Alicia obtains a percentile rank of 90. Whose score will change the most when a raw score of 5 is added?

A

A problem with percentile ranks is that, when the raw scores are normally distributed, raw score differences near the center of the distribution are exaggerated when they are converted to percentile ranks, while raw score differences at the extremes are reduced. (A useful mnemonic for remembering this is “more change in the middle.”)

54
Q

_______ validity (a type of criterion-related validity) refers to the extent to which test scores correlate with scores on an external criterion

_________ validity is associated with criterion-related validity and refers to the increase in decision-making accuracy that results from use of a predictor

A

Concurrent validity

Incremental: increase in correct decisions that can be expected if the predictor is used as a decision making toll: uses data collected from a concurrent or predictive validity study. Involves cutoff score in a scatter plot: determines true positives: those who were predicted to succeed and were successful on the criterion (false +) predicted to succeed on predictor but did not (criterion)
true negative: predicted to be unsuccessful: were (false -) predicted to be successful: were not

55
Q

What is the difference between concurrent and predictive validity?
When are the used?

A

Difference in Timing

Concurrent Criterion related data is collected prior or at the same time
Predictive: measured some time after the predictor.

Concurrent :used to estimate current status
predictive: to predict future performance

56
Q

Do you square or not square a criterion related validity coeffi. when determining shared variability?

A

Square

57
Q

Which has higher reliability? true false, multiple choice with three answers or a multiple choice with six possible answers?

A

All other things being equal, tests containing items that have a low probability of being answered correctly by guessing alone are more reliable than tests containing items that have a high probability of being answered correctly by guessing alone. Of the types of items listed, multiple-choice items with five answer options have the lowest probability of being answered correctly by guessing alone.
-multiple choice with six possible answers

58
Q

An advantage of using the kappa statistic rather than percent agreement when assessing a test’s inter-rater reliability is that the former:

A

The kappa statistic (which is also known as Cohen’s kappa and the kappa coefficient) provides a more accurate estimate of reliability than percent agreement because its calculation includes removing the effects of chance agreement.

59
Q

How do you calculate incremental validity? Base rate?

How do you calculate the positive hit rate?

A

incremental validity = positive hit -Base Rate

Base rate = true positive + false negatives (divided by) total number of people

Positive hit rate = True positive (divided by) total positives

60
Q

What is the difference between sensitivity and specificity? and what is the purpose? How are they calculated?

Sensitivity and Specificity are two of six validity indexes. What are the other four?

A

Sensitivity and specificity provide information about a predictor’s accuracy when administered to a group of individuals who are known to have or not have a disorder.
Sensitivity is the percent who HAVE the DX and are accurately identified by the predictor (calculated: number of true positive (divided by) the number of true pos, + false negatives (above cut off)) Sensitivity refers to the probability that a test will correctly identify people with the disease from the pool of people with the disease. It is calculated using the following formula: true positives divided by (true positives + false negatives).

Specificity: percent of people who DO NOT have DX and accurately identified as not having the Dx
(calculated: number of true negative (divided by ) tue negative + false positives) (below cut off).
Can the test find them?

(sensitive people have the disorder: specific people do not)

Positive and Negative likelihood ratios: the extent to which a positive or negative result on a predictor changes the probability that a person has the disorder that is assess by the predictor: negative or positive results both affect the probability that they person as the disorder (likelihood)

-positive and negative predictor values: The positive predictive value is the probability that a person identified by the test as having the disease actually has the disease. It is calculated with the following formula: true positives or (true positives + false positives).
The negative predictive value is the probability that a person identified by the test as not having the disease doesn’t actually have the disease. The following formula is used to calculate the negative predictive value: true negatives or (true negatives + false negatives).– if they do or do not have the disorder

61
Q

What is cross validation and what causes shrinkage?

What causes Criterion Contamination?

A

Cross validation: cross validate a predictor on another sample to eliminate chance factors related to the uniqueness of that sample. Because the same chance factors of the original group are not in the second group– the cross validation coeff. will shrink.
Shrinkage: the smaller the original validation sample the greater the shrinkage of the validity coeff. when it is cross validated
Criterion Contamination occurs when the accuracy of a criterion measure is impact/influenced by the rater’s knowledge of the employee’s predictor scores when rating. Criterion contamination occurs when a rater’s knowledge of a person’s predictor performance biases how he/she rates the person on the criterion. Criterion contamination has the effect of artificially inflating the correlation between the predictor and the criterion.

62
Q

A test developer would use the Kuder-Richardson Formula (KR-20) in order to:

A

KR-20 is used to determine a test’s internal consistency reliability when test items are scored dichotomously. (KR-20 lube )

63
Q

Content sampling is not a potential source of measurement error for which method for evaluating a test’s reliability?

A

Content sampling refers to the extent to which test scores depend on factors specific to the particular items included in the test (i.e., to its content).
Because test-retest reliability involves administering the same test (i.e., the same content) twice, content sampling is not a source of error.

64
Q

Of the various types of test validity, ________ validity has been described as the broadest category of validity because it overlaps and encompasses all other types.

A
The classical (tripartite) view of validity distinguishes between three major types -- content, criterion-related, and construct. From this perspective, construct validity refers to the degree to which a test measures the construct it was designed to measure. 
 Due to changes in the conceptualization of validity, methods for evaluating validity, and methods of interpreting test scores, construct validity is now described by many experts as a unifying concept of validity.
65
Q

The optimal item difficulty (p) for a true-false test is:

A

.75

66
Q

A screening test for a disorder that has a very low base rate in the population is known to have an overall accuracy rate of 98%. When using this test to identify individuals in the general population who have the disorder, it’s important to keep in mind that the test will produce:

A

This is a difficult question, but you may have been able to identify the correct answer if you know that a low base rate means that there are very few people in the population who have the disorder, which implies that the most likely predictive error will be to falsely identify those who do not have the disorder as having it.

To understand why this answer is incorrect (and answer d is correct), assume that the base rate for the disorder is 1% and that you test a random sample of 10,000 people with the screening test. In this situation, 100 people will have the disease, and the test (which has a 98% accuracy rate) will correctly identify 98 of them - i.e., 98 will be true positives and 2 will be false negatives.

d. CORRECT Continuing with the example, when the base rate for the condition is 1%, 9,900 of the 10,000 people tested will not have the disease, and the test (which has a 98% accuracy rate) will correctly identify 9,702 of them - i.e., 9,702 will be true negatives and the remaining 198 will be false positives. In other words, there will be more false positives than false negatives - and this will be true whenever the predictor has a high accuracy rate and the base rate is less than 50%.

67
Q

Multiple regression analysis is a

A

Metric statistical technique used to analyze the relationship between a single dependent variable and several independent variables

68
Q

To determine how well an examinee did on a test compared to other examinees, you would use:

A

When using norm-referenced interpretation, an examinee’s test performance is compared to the performance of members of the norm group (other people who have taken the test).

69
Q

The item difficulty index (p) represents which scale of measurement?

A

Ordinal

The item difficulty index (p) is calculated by dividing the number of individuals who answered the item correctly by the total number of individuals.

To understand why the item difficulty index represents an ordinal scale, assume that items 1, 2, and 3 of a test are passed by 10, 20, and 30 percent of examinees, respectively, which will result in p values for these items of .10, .20, and .30. Although these values indicate that item 1 is more difficult than item 2 which, in turn, is more difficult than item 3, it is not possible to say that item 2 is twice as difficult as item 1 or that the difference in difficulty between items 1 and 2 is equal to the difference between items 2 and 3. Moreover, an item difficulty index of 0 would not mean that the item completely lacks difficulty (which doesn”t really make any sense). In other words, p values represent an ordinal scale because they do not have the property of equal intervals or an absolute zero point.

70
Q

A criterion has a mean of 100 and a standard deviation of 9. Based on this information, you can conclude that the criterion’s standard error of estimate is between:

A

The standard error of estimate equals 0 when the validity coefficient is equal to 1 (its maximum value) and equals the standard deviation when the validity coefficient is equal to 0 (its minimum value). Therefore, the standard error ranges in value from 0 to the size of the standard deviation (which, in this case, is 9).

(The standard error of estimate is calculated by multiplying the standard deviation of the criterion scores times the square root of 1 minus the validity coefficient squared.)