reliability and validity Flashcards
interrater reliability
this is where two or more judges would mark the same completed test. By correlating each judges score with the other you would be able to see what degree they are in agreement. For example this may occur on curriculum based test to make sure test takers are being fairly and appropriately marked.
IC (and limitations)
, this is the more classical type of reliability that is talked about most often and is a measure of whether each test item in a test measures the same thing, and is generally calculated using “Cronbach’s alpha.” This type of reliability can be calculated by splitting questions into two or more groupings and correlating them to see if they are in agreement with one another. However it is important to note that due to a number of teasons tests don’t give the exact same reading from day to day. For example educational qualities can change in a relatively short time frame, so it is sometimes impossible to detect whether score changes are due to error or are reflections of actual change.
test-retest reliability
useful when aiming to measure something that should remain relatively stable over time eg. IQ. To calculate a co-efficient of test-retest reliability of a group of test takers must sit the same test twice. By correlating the first and second test scores we are able to generate this co-efficient of reliability over time. Whilst the time-frame between the first and second testing depends upon the nature and purpose of the test, and may range from days to months, there is still a risk of practice effects especially if re-test is close to the original test. However this risk is often countered by the use of a parallel test which is essentially the same test with adjusted test items.
what is reliability
1.Provides the consistency that makes validity possible.
2.Indicates how much confidence we can place in our results.
(Gronlund & Linn, 1990)
- Test scores Can be reliable across time, samples, situations and contexts.
- Represented by a reliability coefficient.
The samples and situations in which a test is used may generate different reliability coefficients between combinations, thus it is appropriate to talk about the ‘measure’ being reliable and not the test itself. Test scores aren’t just generally reliable either, they can be reliable over time, samples, situations and contexts. The intended purpose of a test will indicate which of these is important. For example a test which supposedly predicts the state of an individuals ability in ‘mathematics’ for the future needs to be reliable across time, but may need adjustment for different samples. Whilst other tests may need to be reliable across samples eg IQ test. Unlike physical measurement tools, educational tests will always posses an error component which is reflected in the reliability coefficient. A reliability coefficient of .7+ is usually seen as acceptable for any educational test. This coefficient can be calculated and thus the potential for error can be estimated and accounted for during data analysis (Boyle & fisher, 2007).
what is validity
A test is valid if it measures what it claims to measure. For example a test of intelligence should measure intelligence and not something else (such as memory).
main types: internal and external validity; face validity; construct validity; content validity; criterion related validity
factors which affect the reliability of a test
In comparing the reliability co-efficients of two or more tests it is important to consider factors which may inflate or distort these figures. For example whilst initially you may want to favour the test with the higher co-efficient, you may wish to reconsider when recognizing inflating factors irrelevant to consistency of measurement.
Test Length
Longer tests = higher reliability
Spread of scores
Larger spread = higher reliability
Difficulty of tests
Too difficult/easy restricts the Spread of Scores
Objectivity
high objectivity has no affect on reliability
how does test length affect the reliability of the test
Longer tests tend to be more reliable because they provide a larger sample of the behaviour being measured and lessens distortion by minimising the potential for guess work. For example a spelling test containing only one word would result in students being either perfect spellers or complete failures, as it is obvious that one word spelling tests would produce unreliable results it is equally apparent that the more words are added the test the more likely it is that we will generate a more reliable estimate of that child’s spelling ability. In constructing classroom tests, it is important to keep in mind the influence of test length on reliability and strive for longer tests. If short tests are needed because of time limits or pupil age, more frequent testing may be used to obtain a dependable measure
how does the spread of scores affect the reliability of a test
The larger the spread of scores the higher an estimate of reliability will be. This is because these large spreads tend to be obtained when individuals tend to stay in the same relative position in a group from one testing to another, thus anything that reduces the possibility of position change contributes to higher reliability scores. Similarly errors of measurement have less influence on the relative position of individuals when the differences among group members are large – that is when there is a wide spread of scores
how does the difficulty of a test affect its reliability
Norm referenced tests that are too easy or too difficult for a group tend to produce low estimates of reliability due to the restrictive effect these tests would have on the spread of scores. For both too difficult and too easy tests, the differences among individuals scores are small and tend to be unreliable. A norm-referenced test of ideal difficulty will permit the scores to spread out over the full range of the scale. In evaluating the difficulty of a standardized test, the teacher must also take into account the pupil’s achievement level. For example a test devised for average achievement level 10 year olds may be inappropriate for low achieving or over achieving ten year olds as this could have a restricting effect on the spread of scores.
how does test objectivity affect test reliability
This refers to the degree to which equally capable scorers obtain the same results, therefore most standardized tests of aptitude and ability are high in objectivity. If the test items are of the objective type then the resulting scores aren’t influenced by a scorers judgement and as such, these kinds of tests are accurately scored by trained individuals or machines. When such highly objective procedures are used, the reliability of the test results is not affected by the scoring procedures. However in observational based assessments, results vary between scorers. Inconsistencies in scoring may have a negative effect on the reliability of the measurement obtained as error now includes scorer bias aswell as differences between pupils. The solution to this problems lies in selecting an evaluation procedure most appropriate for the behaviour being assessed and to then make that procedure as objective as possible.
internal and external validity
A distinction can be made between internal and external validity
The types of validity are relevant to evaluating the different parts of a study or procedure.
Internal validity refers to whether the effects observed in a study are due to manipulation of the independent variable and not any other factor (variable whose variation does not depend on that of another). This means a causal relationship between the independent and dependent variable can be distinguished.
Internal validity can be improved by controlling extraneous variables, using standardized instructions, counter balancing, and eliminating demand characteristics and investigator effects.
External validity refers to the extent in which the test results can be generalized to other settings (ecological), other people (population validity) and over time (historical validity)
External validity can be improved by setting experiments in a more natural setting and using random sampling to select participants
face validity
Face validity whether the test measures what it claims to at face value.
It is suggested (Nevo 1985) that tests with a clear purpose have higher face validity
Face validity does not mean that the test truly measures what it intends to measure, just the judgement of raters that appears to do so.
It is a very basic measure of validity.
construct validity
The degree the test measures the underlying concept it set out to measure.
Often arrived at by correlating the scores of people on a test which construct validity is being siught with thise of a test that is taken as a ‘benchmark’
Unlike Face validity, construct validity does not measure the question of whether the test measures the attribute. Instead it is whether the test score interpretations are consistent with the theoretical concept.
content validity
The degree to which the test questions fairly represent what the test is intended to measure
criterion-related validity
Two types of which a relationship is established between a criterion which is less convenient assessment of the thing we are trying to measure.
Concurrent
Measure the criterion at the same time as the test is taken
Predictive
Correlate the test with criterion that is taken at a later date. Trying to predict something of the future.