Test Constructio Flashcards

You may prefer our related Brainscape-certified flashcards:
0
Q

Standardization includes:
A. Uniform procedure for administering and scoring a test.
B. includes providing details for administration.
C. Includes establishing norms
D. Objectivity is a product of the standardization process.
E. standardized test is a sample of behavior that will be representative of whole behavior.

A

All. A best answer

Compare test scores to norms or representative sample of population on a test.
Norms must include a truly representative sample of population for which the test is designed (and tester belongs).
To be truly representative a sample must be reasonably large.
Often tests have diff norms..kids, males, blacks etc
Adv of norms compare to others in norm. Allows comparison of performance on different tests
Disadv is don’t provide an absolute or universal standard of good or bad performance. Degree norm is large and representative , this is less relevent. Always relative not absolute stds.

Objective…independent of subjective judgement. Uniform procedures for administering and scoring so examinees should get the same score regardless of who scores it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

Match test characteristics with their definition:
A. Level of difficulty can attain
B. best possible performance
C. Attain pre established level of acceptable performance.
D. Usually do
E. response rate

A

A. Level of difficulty can obtain..power test. Either no time limit or one allows to attempt all items. Usually arranged least to most difficult and has some no one can get. Information subtest
B. test of maximum performance
Ie. achievement tests or aptitude
C. Mastery..usually all or none score
Ie. tests of basic skills
D. Test of typical performance
Ie. personality or interest tests
E. speed test..have time limits and consist of items that all or almost all can get correct. Ie digit symbol

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What limits a test when the measure doesn’t include an adequate range of items at the extremes?

A

Ceiling …not adequately hard items and all smart kids get similar scores
Floor effects…not enough range of easy items so low achieving examinees get similar scores.
Sometimes dc in context of internal validity. Represent an interaction between selection and testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Ipsitive vs normative measures?

A

Ipsitive…individual is the frame of reference in score reporting.
Relative strengths within the individual examinee.
Express preference for one item over another rather than responding to them individually.

Normative…measure of absolute strength of ea attribute measured by the test. Can compare to others.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Defining characteristic of an objective test?
Existence of norms
Stdized set of scoring and administration procedures
Examiner discretion on scoring and interpreting
Reliability and validity

A

B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
Iq test to grp on oct 1 and same test on nov 1.  Interested in:
Tests reliability 
Tests validity
If test is vulnerable to response sets
Double billing
A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Drawback of norm referenced interpretation is
Persons performance is compared to performance of other examinees
Doesn’t allow comparisons of individual scores on diff tests
Doesn’t indicate where stand in relation to others of same population
Not provide absolute stds of performance

A

D. Doesn’t provide absolute stds of good and poor performance. Must interpret in light of norm grp as a whole.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the two ways to think of reliability?
A. Repeatable, dependable results
B. free from error and yields true score
C. Error is minimized
D. Measures what it is supposed to

A

A, b

C may be right..not definitive enough

True score..actual status on attribute being measured

Error (measurement error) refers to factors that are irrelevant to what is being measured. Doesn’t effect all examinees the same way. Due to many factors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
Reliability coefficient:
A.  0 to 1, don't square 
B.  -1 to 1, square
C.  Interpret inversely
D.  Can include any number
A

A. Correct! If 0, not reliable. Due entirely to random factors
If 1, no error, perfect reliability
.90. 90 percent reliable, 10 error

In other words reliability represents the proportion of the total observed variance that is true variance.
Interpret directly
Personality tests usually .7
Selection tests in ind org .9

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Pair method of estimating reliability with definition
A. Internal consistency
B. alternate forms
C. Coefficient of stability
D. divide test in two and correlate halves
E. single administration of test for internal consistency when dichotomously scored.
F. Single administration of a test for internal consistency with multiple scored items
G. Interscorer reliability

F.

A

A. Correlations among individual items. 3 kinds..split half, cronbach coefficient alpha, kuder Richardson formula 20
B. coefficient of equivalence equivalent forms or parallel forms..two equivalent forms to same grp of examinees and correlate
C. Test retest reliability…same test to same grp of ppl and correlate scores on first and second admin
D. Split halves
E. kuder Richardson formula
F. Cronbachs coefficient alpha
G. Inter rater reliability…rater judgement. Most often correlate scores between two raters. Kappa used when nominal data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are sources of measurement error for the test retest coefficient?
A. Factors of time or time sampling.
B. practice effects
C. Longer interval between administrations decreases error
D. Memory for exam

A

A. T…Such as changes in exam conditions (noise, weather)
B. t..do better second time
C. Nope. Shorter time between decreases error.
D. Memory for exam…disadvantage ESP w short interval between. Then coefficient will be spuriously hi

Not appropriate for assessing stuff that measures unstable attributes like mood. This would reflect the instability of the attribute rather than the test. Or tests affected by repetition. So few psychological tests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
What are the sources of error for alternate forms reliability?
A.  Content
B.  time
C.  Memory 
D.  Practice effects
A

A, b true
C,d. Nope! Reduces these problems. However if content is similar may be a bit of a problem

Two forms can’t be administered at the same time. However if administered in immediate succession then not considered a source of error.

Some say best one to use…if hi then it is consistent across time and different content.

Disadv..costly, impractical
Can’t be used for unstable traits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe the three types of coefficients of internal consistency:

Give error source

Describe what they are good for measuring and not good.

A

Correlations among individual items

Split half is correlation between the halves as if two shorter tests. Odd vs even numbered. Lowers reliability bc shorter test is lower.
Overcome w spearman and kuder (multiple scored items)

Error sources…content sampling or item heterogeneity. Lowered if items are different in terms of content sampling.

Good…unstable traits
Bad..speed tests..would be spuriously hi near 1.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Inter rater reliability is increased if:
A. Well trained
B. know being observed
C. Adequate rating scale

A

All

Ie on behavioral rating scale items should be mutually exclusive and exhaustive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Match methods of recording w their definition.
A. Recorder keeps count of number of times the target behavior occurs.
B. observe at intervals and note of engaging or not in behavior
C. Record all behavior subj doing during the time
D. Record elapsed time during which target behavior occurs

A

A. Frequency..good for short time and when duration doesn’t matter
B. interval..good if no fixed beginning or end
C. Continuous. Usually recording all the behavior of target subject during ea observation session…write narrative description in chronological order of everything subject does.
D. Duration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Standard error of measurement is
A. How much error a test contains
B. how much error an individual test score can be expected to have.
C. How much error expected in a criterion score estimated by predictor
D. Measure of internal consistency

A

A. Reliability coefficient
B. correct
C. Standard error of estimate
D. Correlations among individual items.

Standard error of measurement used to make confidence interval which gives the range within which an examinees true score is likely to fall, given obtained score.
Formula is sd multiplied by the square root of 1 minus the reliability coefficient.
If reliability is reduced, error increases.
CI 68% fall one
95% fall +/- 1.96
99% fall +/- 2.58
Ie. score100 +/- 4 (measurement error) or between 96 and 104.
95% is score +/- 1.96 or
100+/- 8 or between 92 and 108.
99% is 90 to 110.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
If reliability coefficient is 1.0 then the standard error of measurement is:
A.  Less than 1
B.  zero 
C.  Between zero and 1
D.  Needs to be calculated
A

B. perfect reliability then no error
Can use formula and see if reliability coefficient is 0 then std error of measurement will be equal to standard deviation of scores.

Same goes for the standard error of the estimate that is used to interpret an individuals predicted score on a given criterion measure. Validity coefficient is 1. Std error of estimate is 0. No error in prediction.
If validity coefficient is 0 then std error of estimate is equal to std deviation of criterion scores.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What factors impact the reliability coefficient?
A. Decrease in score variability
B. anything that increases error variance
C. Longer tests
D. Homogeneous group
E. type of questions asked. T/F
F. Homogeneous items using stats

A

A, B true
C. Longer more reliable (spearman brown is applied to estimate effects of lengthening or shortening a test on its reliability.
D. More homogeneous then variability and then reliability coefficient decrease.
E. test items too hard or too easy variability is decreased and so is the coefficient. Floor/ceiling.
F. Lower reliability if can guess correctly. So true/false is less reliable then multiple choice which is less reliable than fill in blank.
G. For particular inter item consistency as measured by kuder Richardson or coefficient alpha, reliability is increased as items more homogeneous.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q
Would not use kuder Richardson for:
Dichotomously scored test
Measure an unstable trait
Speed test
Psychological test
A

C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q
Way to improve inter rater reliability of a behavioral observation scale would be to use
Mutually exclusive rating categories
Non exhaustive rating cAtegories
Highly valid rating categories
Empirically derived rating categories
A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Std error of measurement is
Inversely related to reliability coefficient and inversely to std dev
Positively related to reliability coef
And positively to std dev
Positively related to reliability nd inversely related to std dev
Inversely related to reliability coef and positively related to std dev

A

D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q
When practical, most advisable to use
Alternate forms reliability coef
Test retest
Internal consistency
Interscorer reliability
A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

According to classical test theory, observed test score reflects
True score variance plus systematic error
True score variance plus random variance
True score variance plus random and systematic
True score variance only

A

B. error is random by definition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q
Which methods of recording is most useful when target behavior has no fixed beginning or end?
Interval
Continuous
Frequency
Duration
A

A. During interval decide if behavior is occurring not when begins or ends.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Match the type of validity with its definition:
A. Predicts someone’s status on an external criterion measure.
B. measures a theoretical, non observable construct or trait.
C. Measures knowledge of domain designed to measure.
D. Hi correlation w another test measures the same thing.
E. low correlation w test that measures something different

A
A.  Criterion related validity
B.  construct validity
C.  Content validity
D.  Convergent validity 
E.  Divergent validity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q
Validity means:
A.  Test measures what is is suppose to measure 
B.  tests usefulness 
C.  Consistent over time
D.  Must consider what it is for
A

A, b, d correct

No test had validity per se

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Content validity: t/f
A. Especially useful for achievement tests
B. extent test items adequately and representatively sample content to be measured
C. Determine via statistical analysis
D. Appears valid to those who take it.

A

A. Correct and used in industrial settings. Like work samples, license exam
B. correct!
C. False. Judgement and agreement of subject matter experts. Clearly identify domain. do subcategories and select from ea. to make rep. May want it to have hi correlation w tests of same content domain or those successful in the class.
D. False! Face validity not really a type of validity but it is desirable or ppl may not cooperate, lack motivation etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Criterion related validity:
A. Scores on predictor test correlated w outside criterion
B. used a correlation coefficient
C. Gather data at same time.
D. Useful at predicting individuals behavior in specific situations.

A

A. True! Criterion is job performance, school achievement, test scores or that which is predicted.
B. true like Pearson r and called the criterion related criterion coefficient. -1 to 1; 1 is perfect validity; o is none. Few exceed .6; even .3 may be ok. Square to interpret. Proportion of variability in criterion explained or shared by variability of the predictor.
C. Part true. Can gather at same time called concurrent validity (typing test) best for current status.
Less costly, more convenient. Often used for predictive (ie pilot. Can’t hire all so do test pick best)

Or can have predictive validation where predictor done first and criterion done later (GRE score and later looked at GPAs and then correlated). Best to predict future status.

D. ESP to select employees , decide admissions, place in special classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q
What is used to determine where the actual criterion score will fall?
Standard error of the mean
Standard error of measurement
Standard error of estimate
Regression line
A

A. How much sample mean can be expected to deviate from population mean
B. reliability measurement
C. Correct!
D. Allows prediction of unknown value of One variable from the known value of another. Used to get PREDICTED score.

C. Use this to get a confidence interval.
95% chance true criterion score will be w in the predicted score
So predicted score +/- (1.96)(std e estimate).
So iq of 115 put into regression equation and shows math score should be 80. SEEst is 5
95% chance actual score between 70 and 90.

80+/- (2x5)

29
Q

What are the differences between the std error of estimate and std error of measurement?

A

SE of measurement is reliability
SE of estimate is validity

SE of measurement estimates where true test score is likely to fall given obtained score on same test. No predictor involved

SE of estimate
Where actual criterion score is likely fall given criterion that was predicted by another measure. Must have a predictor!

Know SE if measurement formula!
Just concept for SE of estimate

30
Q

How do the following factors affect the criterion related validity coefficient?
A. Range of scores
B. in reliability of predictor, criterion
C. Retesting w second sample
D. Education, sex, income,….
E. predictor scores influence criterion status

A

A. Lower validity if range of scores on predictor and/or criterion are restricted. More homogeneous, lower validity.
B. lower validity if unreliable predictor and/or criterion but hi reliability doesn’t guarantee adequate validity. Unreliable always invalid but reliable test may/not be valid.
C. Validation of predictor on second sample will likely be lower then the first. Cross validation. Typically test developed and validated on item by item basis and those w highest correlation w criterion get in. When cross validate this then second lower. Called shrinkage. Predictor tailor made for original validation sample and doesn’t generalize fully. So must cross validate or true validity is overestimated.
D. Validity may vary among subgroups (men, women; hi ses, low ses…). Many moderator variables present for one grp and not another..and then their is differential validity. Rare in industrial settings.
E. criterion contamination
Knowledge of scores on st effects criterion. Artificially INFLATES validity coefficient. No one should have knowledge of scores.

31
Q

What is shrinkage and what factors impact it?

A

Tendency of validity coefficients to decrease in magnitude upon cross validation

Impacted by: increases shrink
Small sample size
Original item pool is large
Number if items kept is small relative to item pool &/or items not chosen on basis of previously formulated ho or experience w criterion.

32
Q

What is construct validity and how is it determined?

A

Psychological variable that is abstract. Not directly observable.
Concern for abstract attributes.

Worked out over time on basis of accumulated evidence.

Establish w choice of methods guided by test developers theory.

33
Q

Construct validity is determined by convergent and divergent validity.

What method is used to assess?

A

Convergent..degree test has hi correlation w another test that is designed to measure same trait or construct.

Divergent..degree test had a low correlation w another test designed to measure different trait or construct. Discriminate.
Only case where a low coefficient provides evidence of hi validity.

Method used is multi trait multimethod matrix. So measure two or more traits by two or more methods.
Convergent if tests of same traits hi correlation even w diff methods
Divergent if two tests measure diff traits have lo correlation even w same method.

Other method is factor analysis.

34
Q

What is factor analysis?
A. Method of establishing tests construct validity
B. determine the degree to which a set of tests are measuring the same underlying constructs or factors
C. Administer set of tests to same grp, correlate ea test, , obtain a factor matrix.
D. Factor matrix indicates how many factors can explain scores in the tests as well as factor loadings or correlation w ea factor.
E. rotate the factors

A

All. Underlying constructs may be only a few and find out how many and to what degree they account for underlying constructs…called latent variables because tests in analysis not directly intended to measure them.
Sometimes said purpose is to detect the structure in a number of variables. Means start w lg number and classify into sets.
Rotation of factors is either orthogonal (in correlated factors) or oblique (correlated factors)

35
Q

Explain the terms in the factor matrix.

Factor loading. Correlation between test and factor. Range -1 to 1. Square. So .7 squared or 49% of variability in vocab is explained by factor 1.

Communality…proportion of variance. Common variance. These factors are also accounting for variance in other tests. Factor loadings squared and added. (.4 squared + .6 squared). .4 so 40% of the variability is explained by whatever traits rep by two factors.
U squared is unique variance which is specific to the test and not explained by the 2 factors in communality.

U sq = 1 - h squared

What two components make up a tests reliability?

What is an Eigenvalue?

A

Communality and specificity or true score variability shared w other tests and part of true variability unique to the test itself.
So tests reliability must be at least as hi as communality(lower limit estimate of reliability).

Explained variance is Eigenvalue .
Bottom of ea factor.
Amount of variance in all tests accounted for by the factor.
Used to determine if factor is acct for significant amt of variability in tests.
Convert to percentage.
(Eigenvalue x100)/number of tests
Tells u total variability.

Most factors ordered by Eigenvalues
So factor one explains more of what is going on than 2 etc
Sum of Eigenvalue can be no larger than number of tests in the analysis.

36
Q

To make sense out of factor analysis one has to interpret and name factors. If factor has a lg enough Eigenvalue it is presumed it is because the factor reps one trait being measured. Inferences based on factors and theory. Usually rotated first. That is, re dividing tests communalities. So clearer pic of loading emerges.
Rotation changes the data? T/f
Name the types and why use ea

A

False. Interrelationships stay the same as do communalities. Just puts in new position. However, Eigenvalues may change after a rotation. So use term only w unrotated factors.

Orthogonal factors independent of each other. Uncorrelated.
Some say always should be used bc easy to interpret.
Uses communality as sum of squared factor loadings.
Oblique factors correlated w ea other .
Some say should use bc most traits and categories are correlated.
Doesn’t use communality.

Test has construct validity if it correlates highly w a factor it would be expected to correlate w.

37
Q

Principal component analysis and cluster analysis are two stat techniques similar to factor analysis. They have the following in common w factor analysis:
A. Reduce larger set of variables into smaller set of underlying elements
B. detect structure in a set if variables
C. Derive factor matrix
D. Use Eigenvalues which indicate the explanatory power of ea construct are computed by squaring and summing factor loadings in unrotated factor matrix
Column.
E. order underlying elements

A

All true for principal component analysis

A, b for cluster analysis

38
Q

What are the differences between factor analysis and principal component analysis?
A. Mathematically
B. terminology
Factor = eigenvector or principle component
C. Variance
Factor= communality, specificity, error
Pcomp analysis= explained, error. So no distinction of specificity.
D. Always uncorrelated. No oblique rotAtion.

A

All

39
Q
What are the differences between factor analysis and cluster analysis?
A.  Variables used.
B.  no use of underlying traits
C.  Tests a hypothesis 
D.  Purpose to create a taxonomy
A

A. True. Factor analysis is interval or ratio. Cluster use any kind
B. true just clusters which are categories and not nec traits or latent variables
C. False. Factor analysis tests ho but not cluster (no a priori ho).
D. True for cluster. Develop classification system. For criminals, rapists, alcoholics..

40
Q

For a test to be valid it must be reliable. Necessary but not sufficient condition for validity.
A. Because a test is reliable doesn’t mean it is valid.
B.criterion related validity coefficient
Cannot exceed square root of predictors reliability coefficients. Reliability coefficient sets a ceiling or upper limit on validity coefficient.
So validity coef is less than or equal to reliability coef. Can’t be higher.
C. Reliability of the coefficient affects criterion related validity coefficient
D. Correction for attenuation used to estimate what a test validity coefficient would be of both predictor and criterion were perfectly reliable.

A

True all

Since reliability sets upper limit on validity if a test has moderate or low reliability will be limited in terms of how valid it can be.

41
Q
Percentage of examinees who answer an item correctly:
A.  Item discrimination
B.  item difficulty
C.  Item analysis 
D.  p
A

B, d
Percentage is the item difficulty index. p
If p=.80 then 80% of examinees passed that item.

Higher p, less difficult the item
Test developers choose items w moderate difficulty bc increase test score variability which is associated w hi levels of reliability and validity. Also provides maximum differentiation between hi and lo scorers.

Diff difficulty level based on purpose.
Should approximate the selection ratio.
Accelerate kids then .25 diff level meanin only 25% pass.
Only want to select 25 %

Mastery tests… item difficulty higher, like .8 or .9.

Possible to get correct thru blind guessing, use higher p. this is bc if p level is too low correct responses are likely to reflect a chance guess not what trying to measure.

Rule of thumb..ave difficulty level of test items should be about half way between 1.0 and level of success expected by chance.
So t/f. P is .75
Multiple choice w 5 possible
.20 and 1.0 so .6

Ave item difficulty is affected by nature of the sample of ppl who tried it out. Try out sample should be representative of population the test is intended.

42
Q
What scale is item difficulty expressed?
Nominal
Ordinal
Ratio 
Interval
A

Ordinal is correct

According to anastasi says not interval or ratio bc equivalent diff in p level do not necessarily indicate equiv diff in difficulty

Indicates rank or difficulty of items but can’t infer differences are equal between the items.

43
Q
Degree items differentiate among examinees in terms of characteristics being measured:
A.  Item discrimination
B.  item difficulty
C.  Item response theory 
D.  Difficulty index
A

A. Item discrimination
Discriminates between say hi and lo scorers. Ie. If actual depressed ppl consistently answer diff non depressed.

Measured many ways..correlate item responses w total test score. Highest items kept for test. Good when 1 attribute and internal consistency is very important.
..predict performance on criterion then ea item correlated w criterion. Pick those hi correlations w criterion but lo w ea other.
Can calculate item discrimination index on some items
D = U - L
D ranges (max) 100 to -100 (those in lower grp and none in hi grp answer correct). 0 is equal proportion and no doscriminability.

Moderate difficulty are associated w max discriminability
Item difficulty places a ceiling on discrimination index.
If difficulty is 1 (all correct) or 0 (no one correct) then D is 0. An item answered the same way has no discriminating value.
Higher D, higher reliability

All methods seem equal

44
Q

What is item response theory?
A. Utilizes the discrimination index
B. p
C. Math approach to item analysis using curves
D. Differential between characteristic measured and items.

A

C. Used graphical depictions of the percentage of ppl in different ability levels who answer ea analyzed item correctly.

Based on assumption that performance on a test is related to how much of a latent or underlying trust is possessed by the respondent.

Curves depict item difficulty and discrimination.
Difficulty is at point of axis where the probability of a correct response is .5. This is another way of measuring item difficulty. Whatever item hits that .50 place makes it say level 4 difficultly.
Slope tells discrimination. Steep curve less useful. Not as steep then more useful at discriminating between hi and lo scorers

45
Q
What are the three things in response theory that are derived from the item characteristic curves?
A.  Item difficulty
B.  probability of guessing
C.  Item reliability
D.  Item discrimination
A

All but c.

A. .5 on ability Axis
B. where curve crosses y axis; if don’t cross then it is zero.
D. Slope tells ya

Pg. 71

46
Q

What are the two assumptions about test items?
A. If reliable, then valid
B. results of testing are sample free
C. How do on item related to estimated amount of latent trait measured.
D. Adaptive testing of ability

A

A. Nope
B. called invariance if item parameters.
Item should have same parameters (difficulty and discrimination) across all samples of the population.’ Implies once analyzed items of wide ranging difficulty levels can be used w any individual to provide estimate of ability. Only true w lg samples
C. True
So can compare scores of individuals w diff items can be directly compared

Also can compare total test scores of a sample to proportion of ppm who answered ea item correctly

D. Not an assumption. Has been applied to this which is giving a set of items to estimated level of ability.

47
Q
Item difficulty level associated w max level of differentiation among examinees is
.1
.5
.75
1.0
A

B

48
Q
Optimal average item difficulty level for a true false test:
.1
.5
.75
1.0
A

C. Probability can answer by chance alone. Should be halfway between 1.0 and level of success expected by chance alone.

49
Q

Test items difficulty level most affected by:
Test length
Tests validity
Nature of testing process
Characteristics if individual taking the test

A

D. Difficulty measured in terms of percentage of examinees who answer the item correctly

50
Q

Which is least true of item response theory?
A. Items analyzed measure a latent trait
B. allows ability levels of different grps of ppl to be compared, even if tester using diff item sets
C. Best w large samples
D. Based on notion that the characteristics of an item will be different depending on the characteristics of the sample of individuals tested.

A

D. Item parameters (difficulty and discrimination will be the same regardless of sample.

51
Q

What is norm referenced interpretation?
A. Provides indication of where examinee stands in relation to others who have taken the test.
B. developmental scores
C. Compare examinees score to others in the normative sample.
D. Within group norms

A

All!
Developmental norms..how far along the developmental path an individual has progressed.
Mental age scores..examinees score compared to ave performance of others at diff age levels. Ratio iq score.
Grade equivalent scores. For educational achievement tests.
Disadv of developmental norms..no comparison at diff age levels. No std deviation. So scored NoT comparable.
Within grp norms…provide comparison of examinees score to most comparable std sample

52
Q

What is the difference between percentile rank and percentage score?

A. Scored better than 90%?
B. Answered 90% of items correct

A

A. Percentile rank
Adv. easy to understand and interpret
Disadv represent ranks and not absolute interpretation between scores
B. percentage

53
Q

What is considered the most satisfactory type of norm referenced score?

Name the types.

A

Standard score…compared raw scores distance from the mean in terms of standard deviation units.
Z scores how many std deviation units above or below mean
T score used psych tests. MMPI
Stanine scores
Deviation iq scores. How iq interpreted now.

54
Q
Percentile ranks and T scores have which in common?
Both standard scores
Both norm referenced scores
Both developmental scores
Both criterion referenced scores
A

B only t is std score

55
Q

Advantage of deviation IQ score, as compared to ratio IQ score:
Gives index of examinees absolute level of intelligence
Indicates examinees mental age
Allows scores of individuals who are the same age to be compared
Allows score comparisons to be made across age levels

A

D. Can compare iq of 9 and 30 yr old

56
Q
Decreasing a tests inter item consistency makes a test
Less valid
Less reliable
More valid
More reliable
A

B. one measure of reliability of a test is how homogeneous or Internally consistent items are (coefficient alpha or kuder 20). There fore, decreasing inter item consistency makes a test less reliable.

57
Q

Factor analysis with 2 orthogonal factors:
Test 1. Factor 1. Factor 2.
.5. .5

How much of the variance in factor I is explained by variance in factor II?
Amt of variance in test 1 attributed to factor I and II is:
0
50
10
Can’t determine?

A

A. Zero. Orthogonal is uncorrelated so it is zero

B. b. communality

58
Q
Selection test for grad school, highest validity shown if scores were correlated w actual grades of:
Lowest scores
Only the middle range scores
All admitted
Only highest scoreS
A

C. Any coefficient will be Lower w a restriction in the range of scores in one or both variables.

59
Q

Which is true?
Valid test will always be reliable
Reliable test will always be valid
Validity coefficient sets ceiling on reliability
Validity coefficient is equal to square root of reliability coefficient.

A

A. Correct
B. not true
C. Reliability sets ceiling
D. uPper limit of validity coefficient is equal to square root of reliability coefficient.

60
Q
To determine the degree an obtained test score is likely to deviate from true test score use:
Std error of estimate
Std error of measurements
Std error of mean
Std error of judgement
A

B. obtained test score likely to differ from true test score to degree it depends on how much error test contains. Error of measurement used to make a range in which true score is likely to fall given obtained.

61
Q
Validity coefficient is zero.  Std error of estimate is
0
Validity coefficient
Std deviation of predictor scores
Std deviation of criterion scores
A

D. Answer by using formula of std error of estimate

Std error of estimate comes out to be equal to std deviation of criterion scores.

62
Q
Test developer.  Max reliability.
Set ave item difficulty level at
If wann develop test only for high qualified then level:
.5, .15
.5, .8
.25, .8
.75, .25
A

A

63
Q
Multi trait multi method matrix is for
Concurrent and predictive validity
Content validity
Face validity
Discrimination and convergent validity
A

D

64
Q
Test developer does thorough job analysis to make work sample that will be used as a selection tool.   Job analysis shows concern w:
Content validity
Criterion related validity
Construct validity 
Face validity
A

A. Job analysis find out tasks and what do in job

65
Q
Difference between coefficient alpha and kuder l:
Alpha is internal consistency coef
Kr20 index inter item consistency
Kr20 used to score dichotomously 
Alpha is making type I error
A

C. Both give index of test ave degree of inter item consistency
Alpha not dichotomously score.

Pg86, 86 review 13 - 18

66
Q

If test has reliability coefficient of .9 we conclude that
Highest validity coefficient test could have is .81
Validity coefficient equal to square root of .9.
Test is probably very valid
Test may or may not be valid

A

D
upper limit of validity coefficient is the square root of .9 (not .81 which is square of .9). Means tests validity is lower than or equal to square root of .9.

Pg. 64
Don’t forget
When orthogonal rotation no correlation between factors

If test has perfect validity there is no error of estimate.

67
Q
Criterion contamination. Has effect of
Increasing validity coefficient
Decreasing validity coefficient 
Increasing examinees criterion scores
Decreasing
A

A
Know predictor lower score may give lower score on criterion.

Results in artificially hi consistency between predictor and criterion and inflates validity coefficient.

68
Q

Communality is
Proportion of variance in test scores that are accounted for by the identified factors.
Use when oblique rotation
Degree of scores on a number of tests that can be reduced to fewer factors
Correlation between a test and a factor

A

A. Correct
B. wrong orthogonal
C. Factor analysis
D. Factor loadinf

69
Q

Correction for attenuation
Formula to tell what reliability would be of lengthen a test
Formula to estimate how much more a valid predictor test would be if it had perfect validity
Mental age, grade equivalent, ratio iq

A

A. Spearman brown
B. correct
C.