Test Constructio Flashcards
Standardization includes:
A. Uniform procedure for administering and scoring a test.
B. includes providing details for administration.
C. Includes establishing norms
D. Objectivity is a product of the standardization process.
E. standardized test is a sample of behavior that will be representative of whole behavior.
All. A best answer
Compare test scores to norms or representative sample of population on a test.
Norms must include a truly representative sample of population for which the test is designed (and tester belongs).
To be truly representative a sample must be reasonably large.
Often tests have diff norms..kids, males, blacks etc
Adv of norms compare to others in norm. Allows comparison of performance on different tests
Disadv is don’t provide an absolute or universal standard of good or bad performance. Degree norm is large and representative , this is less relevent. Always relative not absolute stds.
Objective…independent of subjective judgement. Uniform procedures for administering and scoring so examinees should get the same score regardless of who scores it.
Match test characteristics with their definition:
A. Level of difficulty can attain
B. best possible performance
C. Attain pre established level of acceptable performance.
D. Usually do
E. response rate
A. Level of difficulty can obtain..power test. Either no time limit or one allows to attempt all items. Usually arranged least to most difficult and has some no one can get. Information subtest
B. test of maximum performance
Ie. achievement tests or aptitude
C. Mastery..usually all or none score
Ie. tests of basic skills
D. Test of typical performance
Ie. personality or interest tests
E. speed test..have time limits and consist of items that all or almost all can get correct. Ie digit symbol
What limits a test when the measure doesn’t include an adequate range of items at the extremes?
Ceiling …not adequately hard items and all smart kids get similar scores
Floor effects…not enough range of easy items so low achieving examinees get similar scores.
Sometimes dc in context of internal validity. Represent an interaction between selection and testing.
Ipsitive vs normative measures?
Ipsitive…individual is the frame of reference in score reporting.
Relative strengths within the individual examinee.
Express preference for one item over another rather than responding to them individually.
Normative…measure of absolute strength of ea attribute measured by the test. Can compare to others.
Defining characteristic of an objective test?
Existence of norms
Stdized set of scoring and administration procedures
Examiner discretion on scoring and interpreting
Reliability and validity
B
Iq test to grp on oct 1 and same test on nov 1. Interested in: Tests reliability Tests validity If test is vulnerable to response sets Double billing
A
Drawback of norm referenced interpretation is
Persons performance is compared to performance of other examinees
Doesn’t allow comparisons of individual scores on diff tests
Doesn’t indicate where stand in relation to others of same population
Not provide absolute stds of performance
D. Doesn’t provide absolute stds of good and poor performance. Must interpret in light of norm grp as a whole.
What are the two ways to think of reliability?
A. Repeatable, dependable results
B. free from error and yields true score
C. Error is minimized
D. Measures what it is supposed to
A, b
C may be right..not definitive enough
True score..actual status on attribute being measured
Error (measurement error) refers to factors that are irrelevant to what is being measured. Doesn’t effect all examinees the same way. Due to many factors
Reliability coefficient: A. 0 to 1, don't square B. -1 to 1, square C. Interpret inversely D. Can include any number
A. Correct! If 0, not reliable. Due entirely to random factors
If 1, no error, perfect reliability
.90. 90 percent reliable, 10 error
In other words reliability represents the proportion of the total observed variance that is true variance.
Interpret directly
Personality tests usually .7
Selection tests in ind org .9
Pair method of estimating reliability with definition
A. Internal consistency
B. alternate forms
C. Coefficient of stability
D. divide test in two and correlate halves
E. single administration of test for internal consistency when dichotomously scored.
F. Single administration of a test for internal consistency with multiple scored items
G. Interscorer reliability
F.
A. Correlations among individual items. 3 kinds..split half, cronbach coefficient alpha, kuder Richardson formula 20
B. coefficient of equivalence equivalent forms or parallel forms..two equivalent forms to same grp of examinees and correlate
C. Test retest reliability…same test to same grp of ppl and correlate scores on first and second admin
D. Split halves
E. kuder Richardson formula
F. Cronbachs coefficient alpha
G. Inter rater reliability…rater judgement. Most often correlate scores between two raters. Kappa used when nominal data.
What are sources of measurement error for the test retest coefficient?
A. Factors of time or time sampling.
B. practice effects
C. Longer interval between administrations decreases error
D. Memory for exam
A. T…Such as changes in exam conditions (noise, weather)
B. t..do better second time
C. Nope. Shorter time between decreases error.
D. Memory for exam…disadvantage ESP w short interval between. Then coefficient will be spuriously hi
Not appropriate for assessing stuff that measures unstable attributes like mood. This would reflect the instability of the attribute rather than the test. Or tests affected by repetition. So few psychological tests.
What are the sources of error for alternate forms reliability? A. Content B. time C. Memory D. Practice effects
A, b true
C,d. Nope! Reduces these problems. However if content is similar may be a bit of a problem
Two forms can’t be administered at the same time. However if administered in immediate succession then not considered a source of error.
Some say best one to use…if hi then it is consistent across time and different content.
Disadv..costly, impractical
Can’t be used for unstable traits
Describe the three types of coefficients of internal consistency:
Give error source
Describe what they are good for measuring and not good.
Correlations among individual items
Split half is correlation between the halves as if two shorter tests. Odd vs even numbered. Lowers reliability bc shorter test is lower.
Overcome w spearman and kuder (multiple scored items)
Error sources…content sampling or item heterogeneity. Lowered if items are different in terms of content sampling.
Good…unstable traits
Bad..speed tests..would be spuriously hi near 1.0
Inter rater reliability is increased if:
A. Well trained
B. know being observed
C. Adequate rating scale
All
Ie on behavioral rating scale items should be mutually exclusive and exhaustive.
Match methods of recording w their definition.
A. Recorder keeps count of number of times the target behavior occurs.
B. observe at intervals and note of engaging or not in behavior
C. Record all behavior subj doing during the time
D. Record elapsed time during which target behavior occurs
A. Frequency..good for short time and when duration doesn’t matter
B. interval..good if no fixed beginning or end
C. Continuous. Usually recording all the behavior of target subject during ea observation session…write narrative description in chronological order of everything subject does.
D. Duration
Standard error of measurement is
A. How much error a test contains
B. how much error an individual test score can be expected to have.
C. How much error expected in a criterion score estimated by predictor
D. Measure of internal consistency
A. Reliability coefficient
B. correct
C. Standard error of estimate
D. Correlations among individual items.
Standard error of measurement used to make confidence interval which gives the range within which an examinees true score is likely to fall, given obtained score.
Formula is sd multiplied by the square root of 1 minus the reliability coefficient.
If reliability is reduced, error increases.
CI 68% fall one
95% fall +/- 1.96
99% fall +/- 2.58
Ie. score100 +/- 4 (measurement error) or between 96 and 104.
95% is score +/- 1.96 or
100+/- 8 or between 92 and 108.
99% is 90 to 110.
If reliability coefficient is 1.0 then the standard error of measurement is: A. Less than 1 B. zero C. Between zero and 1 D. Needs to be calculated
B. perfect reliability then no error
Can use formula and see if reliability coefficient is 0 then std error of measurement will be equal to standard deviation of scores.
Same goes for the standard error of the estimate that is used to interpret an individuals predicted score on a given criterion measure. Validity coefficient is 1. Std error of estimate is 0. No error in prediction.
If validity coefficient is 0 then std error of estimate is equal to std deviation of criterion scores.
What factors impact the reliability coefficient?
A. Decrease in score variability
B. anything that increases error variance
C. Longer tests
D. Homogeneous group
E. type of questions asked. T/F
F. Homogeneous items using stats
A, B true
C. Longer more reliable (spearman brown is applied to estimate effects of lengthening or shortening a test on its reliability.
D. More homogeneous then variability and then reliability coefficient decrease.
E. test items too hard or too easy variability is decreased and so is the coefficient. Floor/ceiling.
F. Lower reliability if can guess correctly. So true/false is less reliable then multiple choice which is less reliable than fill in blank.
G. For particular inter item consistency as measured by kuder Richardson or coefficient alpha, reliability is increased as items more homogeneous.
Would not use kuder Richardson for: Dichotomously scored test Measure an unstable trait Speed test Psychological test
C
Way to improve inter rater reliability of a behavioral observation scale would be to use Mutually exclusive rating categories Non exhaustive rating cAtegories Highly valid rating categories Empirically derived rating categories
A
Std error of measurement is
Inversely related to reliability coefficient and inversely to std dev
Positively related to reliability coef
And positively to std dev
Positively related to reliability nd inversely related to std dev
Inversely related to reliability coef and positively related to std dev
D
When practical, most advisable to use Alternate forms reliability coef Test retest Internal consistency Interscorer reliability
A
According to classical test theory, observed test score reflects
True score variance plus systematic error
True score variance plus random variance
True score variance plus random and systematic
True score variance only
B. error is random by definition
Which methods of recording is most useful when target behavior has no fixed beginning or end? Interval Continuous Frequency Duration
A. During interval decide if behavior is occurring not when begins or ends.
Match the type of validity with its definition:
A. Predicts someone’s status on an external criterion measure.
B. measures a theoretical, non observable construct or trait.
C. Measures knowledge of domain designed to measure.
D. Hi correlation w another test measures the same thing.
E. low correlation w test that measures something different
A. Criterion related validity B. construct validity C. Content validity D. Convergent validity E. Divergent validity
Validity means: A. Test measures what is is suppose to measure B. tests usefulness C. Consistent over time D. Must consider what it is for
A, b, d correct
No test had validity per se
Content validity: t/f
A. Especially useful for achievement tests
B. extent test items adequately and representatively sample content to be measured
C. Determine via statistical analysis
D. Appears valid to those who take it.
A. Correct and used in industrial settings. Like work samples, license exam
B. correct!
C. False. Judgement and agreement of subject matter experts. Clearly identify domain. do subcategories and select from ea. to make rep. May want it to have hi correlation w tests of same content domain or those successful in the class.
D. False! Face validity not really a type of validity but it is desirable or ppl may not cooperate, lack motivation etc.
Criterion related validity:
A. Scores on predictor test correlated w outside criterion
B. used a correlation coefficient
C. Gather data at same time.
D. Useful at predicting individuals behavior in specific situations.
A. True! Criterion is job performance, school achievement, test scores or that which is predicted.
B. true like Pearson r and called the criterion related criterion coefficient. -1 to 1; 1 is perfect validity; o is none. Few exceed .6; even .3 may be ok. Square to interpret. Proportion of variability in criterion explained or shared by variability of the predictor.
C. Part true. Can gather at same time called concurrent validity (typing test) best for current status.
Less costly, more convenient. Often used for predictive (ie pilot. Can’t hire all so do test pick best)
Or can have predictive validation where predictor done first and criterion done later (GRE score and later looked at GPAs and then correlated). Best to predict future status.
D. ESP to select employees , decide admissions, place in special classes.
What is used to determine where the actual criterion score will fall? Standard error of the mean Standard error of measurement Standard error of estimate Regression line
A. How much sample mean can be expected to deviate from population mean
B. reliability measurement
C. Correct!
D. Allows prediction of unknown value of One variable from the known value of another. Used to get PREDICTED score.
C. Use this to get a confidence interval.
95% chance true criterion score will be w in the predicted score
So predicted score +/- (1.96)(std e estimate).
So iq of 115 put into regression equation and shows math score should be 80. SEEst is 5
95% chance actual score between 70 and 90.
80+/- (2x5)
What are the differences between the std error of estimate and std error of measurement?
SE of measurement is reliability
SE of estimate is validity
SE of measurement estimates where true test score is likely to fall given obtained score on same test. No predictor involved
SE of estimate
Where actual criterion score is likely fall given criterion that was predicted by another measure. Must have a predictor!
Know SE if measurement formula!
Just concept for SE of estimate
How do the following factors affect the criterion related validity coefficient?
A. Range of scores
B. in reliability of predictor, criterion
C. Retesting w second sample
D. Education, sex, income,….
E. predictor scores influence criterion status
A. Lower validity if range of scores on predictor and/or criterion are restricted. More homogeneous, lower validity.
B. lower validity if unreliable predictor and/or criterion but hi reliability doesn’t guarantee adequate validity. Unreliable always invalid but reliable test may/not be valid.
C. Validation of predictor on second sample will likely be lower then the first. Cross validation. Typically test developed and validated on item by item basis and those w highest correlation w criterion get in. When cross validate this then second lower. Called shrinkage. Predictor tailor made for original validation sample and doesn’t generalize fully. So must cross validate or true validity is overestimated.
D. Validity may vary among subgroups (men, women; hi ses, low ses…). Many moderator variables present for one grp and not another..and then their is differential validity. Rare in industrial settings.
E. criterion contamination
Knowledge of scores on st effects criterion. Artificially INFLATES validity coefficient. No one should have knowledge of scores.
What is shrinkage and what factors impact it?
Tendency of validity coefficients to decrease in magnitude upon cross validation
Impacted by: increases shrink
Small sample size
Original item pool is large
Number if items kept is small relative to item pool &/or items not chosen on basis of previously formulated ho or experience w criterion.
What is construct validity and how is it determined?
Psychological variable that is abstract. Not directly observable.
Concern for abstract attributes.
Worked out over time on basis of accumulated evidence.
Establish w choice of methods guided by test developers theory.
Construct validity is determined by convergent and divergent validity.
What method is used to assess?
Convergent..degree test has hi correlation w another test that is designed to measure same trait or construct.
Divergent..degree test had a low correlation w another test designed to measure different trait or construct. Discriminate.
Only case where a low coefficient provides evidence of hi validity.
Method used is multi trait multimethod matrix. So measure two or more traits by two or more methods.
Convergent if tests of same traits hi correlation even w diff methods
Divergent if two tests measure diff traits have lo correlation even w same method.
Other method is factor analysis.
What is factor analysis?
A. Method of establishing tests construct validity
B. determine the degree to which a set of tests are measuring the same underlying constructs or factors
C. Administer set of tests to same grp, correlate ea test, , obtain a factor matrix.
D. Factor matrix indicates how many factors can explain scores in the tests as well as factor loadings or correlation w ea factor.
E. rotate the factors
All. Underlying constructs may be only a few and find out how many and to what degree they account for underlying constructs…called latent variables because tests in analysis not directly intended to measure them.
Sometimes said purpose is to detect the structure in a number of variables. Means start w lg number and classify into sets.
Rotation of factors is either orthogonal (in correlated factors) or oblique (correlated factors)
Explain the terms in the factor matrix.
Factor loading. Correlation between test and factor. Range -1 to 1. Square. So .7 squared or 49% of variability in vocab is explained by factor 1.
Communality…proportion of variance. Common variance. These factors are also accounting for variance in other tests. Factor loadings squared and added. (.4 squared + .6 squared). .4 so 40% of the variability is explained by whatever traits rep by two factors.
U squared is unique variance which is specific to the test and not explained by the 2 factors in communality.
U sq = 1 - h squared
What two components make up a tests reliability?
What is an Eigenvalue?
Communality and specificity or true score variability shared w other tests and part of true variability unique to the test itself.
So tests reliability must be at least as hi as communality(lower limit estimate of reliability).
Explained variance is Eigenvalue .
Bottom of ea factor.
Amount of variance in all tests accounted for by the factor.
Used to determine if factor is acct for significant amt of variability in tests.
Convert to percentage.
(Eigenvalue x100)/number of tests
Tells u total variability.
Most factors ordered by Eigenvalues
So factor one explains more of what is going on than 2 etc
Sum of Eigenvalue can be no larger than number of tests in the analysis.
To make sense out of factor analysis one has to interpret and name factors. If factor has a lg enough Eigenvalue it is presumed it is because the factor reps one trait being measured. Inferences based on factors and theory. Usually rotated first. That is, re dividing tests communalities. So clearer pic of loading emerges.
Rotation changes the data? T/f
Name the types and why use ea
False. Interrelationships stay the same as do communalities. Just puts in new position. However, Eigenvalues may change after a rotation. So use term only w unrotated factors.
Orthogonal factors independent of each other. Uncorrelated.
Some say always should be used bc easy to interpret.
Uses communality as sum of squared factor loadings.
Oblique factors correlated w ea other .
Some say should use bc most traits and categories are correlated.
Doesn’t use communality.
Test has construct validity if it correlates highly w a factor it would be expected to correlate w.
Principal component analysis and cluster analysis are two stat techniques similar to factor analysis. They have the following in common w factor analysis:
A. Reduce larger set of variables into smaller set of underlying elements
B. detect structure in a set if variables
C. Derive factor matrix
D. Use Eigenvalues which indicate the explanatory power of ea construct are computed by squaring and summing factor loadings in unrotated factor matrix
Column.
E. order underlying elements
All true for principal component analysis
A, b for cluster analysis
What are the differences between factor analysis and principal component analysis?
A. Mathematically
B. terminology
Factor = eigenvector or principle component
C. Variance
Factor= communality, specificity, error
Pcomp analysis= explained, error. So no distinction of specificity.
D. Always uncorrelated. No oblique rotAtion.
All
What are the differences between factor analysis and cluster analysis? A. Variables used. B. no use of underlying traits C. Tests a hypothesis D. Purpose to create a taxonomy
A. True. Factor analysis is interval or ratio. Cluster use any kind
B. true just clusters which are categories and not nec traits or latent variables
C. False. Factor analysis tests ho but not cluster (no a priori ho).
D. True for cluster. Develop classification system. For criminals, rapists, alcoholics..
For a test to be valid it must be reliable. Necessary but not sufficient condition for validity.
A. Because a test is reliable doesn’t mean it is valid.
B.criterion related validity coefficient
Cannot exceed square root of predictors reliability coefficients. Reliability coefficient sets a ceiling or upper limit on validity coefficient.
So validity coef is less than or equal to reliability coef. Can’t be higher.
C. Reliability of the coefficient affects criterion related validity coefficient
D. Correction for attenuation used to estimate what a test validity coefficient would be of both predictor and criterion were perfectly reliable.
True all
Since reliability sets upper limit on validity if a test has moderate or low reliability will be limited in terms of how valid it can be.
Percentage of examinees who answer an item correctly: A. Item discrimination B. item difficulty C. Item analysis D. p
B, d
Percentage is the item difficulty index. p
If p=.80 then 80% of examinees passed that item.
Higher p, less difficult the item
Test developers choose items w moderate difficulty bc increase test score variability which is associated w hi levels of reliability and validity. Also provides maximum differentiation between hi and lo scorers.
Diff difficulty level based on purpose.
Should approximate the selection ratio.
Accelerate kids then .25 diff level meanin only 25% pass.
Only want to select 25 %
Mastery tests… item difficulty higher, like .8 or .9.
Possible to get correct thru blind guessing, use higher p. this is bc if p level is too low correct responses are likely to reflect a chance guess not what trying to measure.
Rule of thumb..ave difficulty level of test items should be about half way between 1.0 and level of success expected by chance.
So t/f. P is .75
Multiple choice w 5 possible
.20 and 1.0 so .6
Ave item difficulty is affected by nature of the sample of ppl who tried it out. Try out sample should be representative of population the test is intended.
What scale is item difficulty expressed? Nominal Ordinal Ratio Interval
Ordinal is correct
According to anastasi says not interval or ratio bc equivalent diff in p level do not necessarily indicate equiv diff in difficulty
Indicates rank or difficulty of items but can’t infer differences are equal between the items.
Degree items differentiate among examinees in terms of characteristics being measured: A. Item discrimination B. item difficulty C. Item response theory D. Difficulty index
A. Item discrimination
Discriminates between say hi and lo scorers. Ie. If actual depressed ppl consistently answer diff non depressed.
Measured many ways..correlate item responses w total test score. Highest items kept for test. Good when 1 attribute and internal consistency is very important.
..predict performance on criterion then ea item correlated w criterion. Pick those hi correlations w criterion but lo w ea other.
Can calculate item discrimination index on some items
D = U - L
D ranges (max) 100 to -100 (those in lower grp and none in hi grp answer correct). 0 is equal proportion and no doscriminability.
Moderate difficulty are associated w max discriminability
Item difficulty places a ceiling on discrimination index.
If difficulty is 1 (all correct) or 0 (no one correct) then D is 0. An item answered the same way has no discriminating value.
Higher D, higher reliability
All methods seem equal
What is item response theory?
A. Utilizes the discrimination index
B. p
C. Math approach to item analysis using curves
D. Differential between characteristic measured and items.
C. Used graphical depictions of the percentage of ppl in different ability levels who answer ea analyzed item correctly.
Based on assumption that performance on a test is related to how much of a latent or underlying trust is possessed by the respondent.
Curves depict item difficulty and discrimination.
Difficulty is at point of axis where the probability of a correct response is .5. This is another way of measuring item difficulty. Whatever item hits that .50 place makes it say level 4 difficultly.
Slope tells discrimination. Steep curve less useful. Not as steep then more useful at discriminating between hi and lo scorers
What are the three things in response theory that are derived from the item characteristic curves? A. Item difficulty B. probability of guessing C. Item reliability D. Item discrimination
All but c.
A. .5 on ability Axis
B. where curve crosses y axis; if don’t cross then it is zero.
D. Slope tells ya
Pg. 71
What are the two assumptions about test items?
A. If reliable, then valid
B. results of testing are sample free
C. How do on item related to estimated amount of latent trait measured.
D. Adaptive testing of ability
A. Nope
B. called invariance if item parameters.
Item should have same parameters (difficulty and discrimination) across all samples of the population.’ Implies once analyzed items of wide ranging difficulty levels can be used w any individual to provide estimate of ability. Only true w lg samples
C. True
So can compare scores of individuals w diff items can be directly compared
Also can compare total test scores of a sample to proportion of ppm who answered ea item correctly
D. Not an assumption. Has been applied to this which is giving a set of items to estimated level of ability.
Item difficulty level associated w max level of differentiation among examinees is .1 .5 .75 1.0
B
Optimal average item difficulty level for a true false test: .1 .5 .75 1.0
C. Probability can answer by chance alone. Should be halfway between 1.0 and level of success expected by chance alone.
Test items difficulty level most affected by:
Test length
Tests validity
Nature of testing process
Characteristics if individual taking the test
D. Difficulty measured in terms of percentage of examinees who answer the item correctly
Which is least true of item response theory?
A. Items analyzed measure a latent trait
B. allows ability levels of different grps of ppl to be compared, even if tester using diff item sets
C. Best w large samples
D. Based on notion that the characteristics of an item will be different depending on the characteristics of the sample of individuals tested.
D. Item parameters (difficulty and discrimination will be the same regardless of sample.
What is norm referenced interpretation?
A. Provides indication of where examinee stands in relation to others who have taken the test.
B. developmental scores
C. Compare examinees score to others in the normative sample.
D. Within group norms
All!
Developmental norms..how far along the developmental path an individual has progressed.
Mental age scores..examinees score compared to ave performance of others at diff age levels. Ratio iq score.
Grade equivalent scores. For educational achievement tests.
Disadv of developmental norms..no comparison at diff age levels. No std deviation. So scored NoT comparable.
Within grp norms…provide comparison of examinees score to most comparable std sample
What is the difference between percentile rank and percentage score?
A. Scored better than 90%?
B. Answered 90% of items correct
A. Percentile rank
Adv. easy to understand and interpret
Disadv represent ranks and not absolute interpretation between scores
B. percentage
What is considered the most satisfactory type of norm referenced score?
Name the types.
Standard score…compared raw scores distance from the mean in terms of standard deviation units.
Z scores how many std deviation units above or below mean
T score used psych tests. MMPI
Stanine scores
Deviation iq scores. How iq interpreted now.
Percentile ranks and T scores have which in common? Both standard scores Both norm referenced scores Both developmental scores Both criterion referenced scores
B only t is std score
Advantage of deviation IQ score, as compared to ratio IQ score:
Gives index of examinees absolute level of intelligence
Indicates examinees mental age
Allows scores of individuals who are the same age to be compared
Allows score comparisons to be made across age levels
D. Can compare iq of 9 and 30 yr old
Decreasing a tests inter item consistency makes a test Less valid Less reliable More valid More reliable
B. one measure of reliability of a test is how homogeneous or Internally consistent items are (coefficient alpha or kuder 20). There fore, decreasing inter item consistency makes a test less reliable.
Factor analysis with 2 orthogonal factors:
Test 1. Factor 1. Factor 2.
.5. .5
How much of the variance in factor I is explained by variance in factor II?
Amt of variance in test 1 attributed to factor I and II is:
0
50
10
Can’t determine?
A. Zero. Orthogonal is uncorrelated so it is zero
B. b. communality
Selection test for grad school, highest validity shown if scores were correlated w actual grades of: Lowest scores Only the middle range scores All admitted Only highest scoreS
C. Any coefficient will be Lower w a restriction in the range of scores in one or both variables.
Which is true?
Valid test will always be reliable
Reliable test will always be valid
Validity coefficient sets ceiling on reliability
Validity coefficient is equal to square root of reliability coefficient.
A. Correct
B. not true
C. Reliability sets ceiling
D. uPper limit of validity coefficient is equal to square root of reliability coefficient.
To determine the degree an obtained test score is likely to deviate from true test score use: Std error of estimate Std error of measurements Std error of mean Std error of judgement
B. obtained test score likely to differ from true test score to degree it depends on how much error test contains. Error of measurement used to make a range in which true score is likely to fall given obtained.
Validity coefficient is zero. Std error of estimate is 0 Validity coefficient Std deviation of predictor scores Std deviation of criterion scores
D. Answer by using formula of std error of estimate
Std error of estimate comes out to be equal to std deviation of criterion scores.
Test developer. Max reliability. Set ave item difficulty level at If wann develop test only for high qualified then level: .5, .15 .5, .8 .25, .8 .75, .25
A
Multi trait multi method matrix is for Concurrent and predictive validity Content validity Face validity Discrimination and convergent validity
D
Test developer does thorough job analysis to make work sample that will be used as a selection tool. Job analysis shows concern w: Content validity Criterion related validity Construct validity Face validity
A. Job analysis find out tasks and what do in job
Difference between coefficient alpha and kuder l: Alpha is internal consistency coef Kr20 index inter item consistency Kr20 used to score dichotomously Alpha is making type I error
C. Both give index of test ave degree of inter item consistency
Alpha not dichotomously score.
Pg86, 86 review 13 - 18
If test has reliability coefficient of .9 we conclude that
Highest validity coefficient test could have is .81
Validity coefficient equal to square root of .9.
Test is probably very valid
Test may or may not be valid
D
upper limit of validity coefficient is the square root of .9 (not .81 which is square of .9). Means tests validity is lower than or equal to square root of .9.
Pg. 64
Don’t forget
When orthogonal rotation no correlation between factors
If test has perfect validity there is no error of estimate.
Criterion contamination. Has effect of Increasing validity coefficient Decreasing validity coefficient Increasing examinees criterion scores Decreasing
A
Know predictor lower score may give lower score on criterion.
Results in artificially hi consistency between predictor and criterion and inflates validity coefficient.
Communality is
Proportion of variance in test scores that are accounted for by the identified factors.
Use when oblique rotation
Degree of scores on a number of tests that can be reduced to fewer factors
Correlation between a test and a factor
A. Correct
B. wrong orthogonal
C. Factor analysis
D. Factor loadinf
Correction for attenuation
Formula to tell what reliability would be of lengthen a test
Formula to estimate how much more a valid predictor test would be if it had perfect validity
Mental age, grade equivalent, ratio iq
A. Spearman brown
B. correct
C.