Measuring Learning Flashcards
Explain current enrollment trends in developing countries (and their relationship with GDP)
Enrollment in today’s poor countries is far higher than enrollment was in rich countries when those rich countries were poor. GDP today is a strong predictor of learning levels, given that most countries fall pretty close to the prediction line.
When it comes to testing validity, in an ideal world, policymakers should care most about tests that:
Predict longer term outcomes that we care about
predictive validity of a test
The predictive validity of a test is its ability to predict longer-term outcomes such as income, crime, etc.
concurrent validity of a test
The concurrent validity of a test is how it correlates with other validated tests.
Convergent-discriminant validity
Convergent-discriminant validity is whether a test is correlated more with tests that measure similar concepts, and less with tests that measure different concepts.
Test-retest reliability
Consistency with which a test measures any given skill
Administrative data are often used to determine the resources that schools, students and households receive. Because of these policies:
Enrollment and attendance both tend to be inflated
Policies tend to give schools more resources when they have higher enrollment and provide incentives to parents based on attendance. Therefore, schools have incentives to inflate enrollment and attendance.
What is the most basic measure of teacher effort?
Teacher attendance is “the most basic measure of teacher effort”. It can be measured through principal and student surveys. However, unless teacher behavior is an explicit step in the theory of change, it may not be necessary given how costly effort can be to collect.
Barriers to school participation?
Convenience and access, out of pocket costs, health issues, underestimate the long-term benefits of education, discount the future
How common is absenteeism of teachers/service providers in developing countries?
Absenteeism is widespread and unpredictable
Even when present, often not teaching
Few service providers face a serious threat of being fired for excessive absences
With almost 100% primary enrollment, why are students still struggling to learn?
Enrollment itself doesn’t mean that students are regularly attending school
Being in school does not mean that children are learning
What is the main approach we could use to improve learning in developing countries?
Many options here…but our focus is to pivot expenditure from less to more cost effective policies to
improve outcomes at any given level of per capita income
Attendance conditional on enrollment
fraction of those enrolled who are present on a given day
School attendance in the population
Percentage of school days the average child in a given
population is in school
Why is it important to collect both enrollment and attendance data for a study?
– Need to assume the program did not change the attendance rate of those enrolled.
– May bias measures of impact and cost-effectiveness
– Good to supplement with direct attendance data as well
2 ways to measure teacher attendance
– Teacher attendance records (but often fudged)
– Direct observation during surprise visits (need to do this early in the visit)
4 ways to measure teaching efforts
– Classroom observations
– Student surveys
– Principal surveys
– Teacher knowledge tests
2 ways to measure teacher knowledge
– Subject matter knowledge
– Subject-specific pedagogical knowledge
Purpose of classroom observations
systematize observers’ perceptions of teacher quality
Classroom observations - should they be short or long?
Short observations are efficient. They offer more information than single observations of the same length.
Classroom observations - should teachers be able to choose their own lessons?
Teachers can choose their lessons. It doesn’t make it harder to identify effective teachers; in fact, it makes it easier.
Classroom observations - How should you incorporate principals?
Principals are useful observers. They rate their own teachers higher, but their ratings are highly correlated with those of other observers.
Classroom observations - should you add another observer?
Adding an observer pays off more than adding another lesson.
Classroom observations - predictive validity
Classroom observation ratings on a given year predict teachers’ value-added on the following year, after random assignment
However, predictive validity varies across instruments
and along the performance distribution (some instruments are better at identifying low- or high-performing teachers)
Predictive validity also varies across subjects
(all instruments are better at predicting value-added in math) and varies according to the types of skills assessed
Student surveys - purpose
measure students’ perceptions of teacher quality
Tripod survey - length & structure?
Length:
– Full: 67 questions (elementary) or 92 questions (secondary)
– Lite: 36 questions (elementary and secondary)
• Structure: 7 “Cs”
1. Care (does the teacher care about the student?)
2. Control (is the teacher in control of the classroom?)
3. Clarify (does the teacher clarify difficult concepts?)
4. Challenge (does the teacher challenge students?)
5. Captivate (does the teacher keep students’ attention?)
6. Confer (does the teacher engage students in discussions?)
7. Consolidate (does the teacher recap/review material?)
Tripod survey - main elements
Not organized by “Cs” to avoid “priming” students
Short, age-appropriate statements that children can understand
Some statements are reverse-coded to contribute to the score for each “C”
Likert scale for children to indicate the extent to which they agree with the statement (sometimes expressed in terms of frequency)
Principal surveys - purpose
systematize principals’ perceptions of teacher quality
Principal surveys - domains assessed
– Overall teaching effectiveness – Dedication and work ethic – Organization – Classroom management – Raising student achievement (in math and reading) – Role model for students – Student satisfaction with teacher – Parent satisfaction with teacher – Positive relationship with colleagues – Positive relationship with administrators
Principal surveys - predictive validity
Principals can predict teacher effectiveness with a single question on their overall effectiveness.
Principals are skeptical of identifying poor performers
even when there are no stakes
Teacher knowledge tests - purpose
measure teachers’ content knowledge, subject-specific
pedagogical knowledge, or understanding of student errors
School management quality surveys - purpose
measure quality of school management (usually in the context of interventions to improve governance)
World Management Surveys adapted for education
developed by Bloom, Lemos, Sadun, Van Reenen (2015)
• Management quality measured on: – Operations – Monitoring – Target Setting – People Management
Instrument recently adapted for developing countries by creating finer gradations in the 5-point scale
What is the main lesson regarding the use of principal surveys to measure teacher effort?
They produce measures of effort that effectively predict student achievement
Principal surveys of teacher effort are remarkably predictive of teacher value added, even though there is reluctance to identify weak performers. It is management indices (not teacher effort) that tend to cluster at low levels of the scale. We have no evidence of whether or not they predict teacher knowledge.
In an impact evaluation of an intervention that gives 4th-grade teachers incentives for improvements in reading, what might we be worried about if we measured learning only using an oral test that measures basic literacy (can the student read a sentence)?
It may be subject to ceiling effects where the distribution is censored at higher levels of achievement
Our literacy test doesn’t measure potential negative side effects of the incentive program
The test may not be able to pick up any differences among the highly literate students. Efforts to improve reading may increase at the expense of time teaching another subject and we are not measuring that. Oral tests can indeed be adaptive (unlike paper-pencil tests).
What is a good test score?
• Appropriate to the context
– Major need for piloting, adaptation of instruments
• Measures what we think it measures
– We want to measure learning, not test-taking skills or speed
• Focused on dimensions that we think the intervention might improve
– Requires thinking carefully about what kind of test domains we want to focus on
– Also requires thinking about how the assessment might be ‘gamed’
What is a good test score? - Distribution
• Continuous well-distributed measure of student achievement
– No ceiling or floor effects
– Not be “too easy”, “too hard” or “too short”
• This Goldilocks zone can often be very hard to achieve!
What is a good test score? - Discrimination
• Tests should be discriminating i.e. informative at all levels of ability
– should be able to distinguish differences in absolute achievement around 10th percentile as well as around median ability
– This is often hard to do:
• PISA, TIMSS etc. not informative at very low achievement levels
• ASER not informative at high achievement levels
What is a good test score? Dynamic comparability
dynamic comparability is a test that allows you to measure progress of student learning over time.
What is a good test score? - Cross-sectional comparability
cross-sectional comparability is something that allows you to place a student in a wider distribution of contemporaries. So this could be a peer group in the same state, in the same country, an international peer group–and that’s cross-sectional comparability.
What is a good test score? - Benchmarking
if there’s an absolute standard out there of what is considered a grade-appropriate competence, is how are your kids doing relative to that benchmark?
The main purpose of using a common subset of questions that are repeated across tests is to ensure that:
Achievement can be compared across time and samples
When designing a test, how should we think about grade-appropriate tests?
Grade-appropriate tests are particularly inappropriate for many developing country contexts (kids are so far behind in learning)
Try to design a test that contain items targeting a wide distribution of achievement
When designing a test, how should we think about choosing each item?
Each time should map into a concrete skill that we want to test, there should be a subset of items repeated across rounds for comparability and a subset of items should be drawn from other assessments
When designing a test, how should we think about language?
It should not be assumed that item properties are maintained in translation
3 common ways a test is administered
Individually, group-oral, written
Advantages vs. disadvantages between ways tests are adminstered
Individual oral much better for assessing children at young ages but very burdensome in the field
– Group oral attempts to replicate above at scale but classroom management is not easy, answers less precise
– Written tests are ideal for later grades but with a strong possibility of floor effects in primary grades
Cognition test type?
Raven’s matrices
Early Grade Learning test type?
EGRA, ASER
Higher Level Learning test type?
SAT, GMAT
Learning outcomes are often reported in terms of standard deviations rather than raw test scores primarily because:
Doing so allows us to compare results across studies that use different tests
Item Response Theory
it allows you to compare kids in a common distribution so that you’re able to do better cross-sectional comparison and better over time comparison, even though the content of the test itself might be changing.
Models the probability that an individual with given ability will get an item right
the most important advantage of IRT is the ability to link across tests and over time.
item characteristic curve.
Maps the trait (ability or knowledge) to the proportion correct
guessing parameter
The probability that an examinee with no ability or knowledge will answer a question correctly (basically guesses and by chance gets it right)
Where the curve intersects with Y axis?
Difficulty parameter
How difficult the question is; level of ability an examinee needs to get the question right with probability (1+c/2).
If you move the curve to the right, the difficulty will increase.
Mid-point of the curve; on x-axis
Discrimination parameter
Measure of how well the question is able to distinguish between examinees of different ability/knowledge. How steep the curve is.
if this ICC is much flatter, then that tells you that even kids who don’t know much could get it right, and kids who know a lot could get it wrong.
What are we able to do with a test designed with IRT, that we are unable to do with a test that was not designed with IRT?
Report treatment effects in standard deviations relative to the absolute progress made in the control group
Using a simpler non-IRT test, we can report…
Using a simpler non-IRT test, we can report total scores, total relative to both the control group (simple difference),and relative to the baseline (pre-post); we can report “improvements” relative to the control group (difference-in-difference), and we can report improvements as percentage with either the baseline in the denominator, the control group total in the in the denominator, the control group improvement in the denominator or even the control group percentage gain in the denominator. Or we can report results as standard deviations (always normalizing the control group to equal zero).
With an IRT test, we can report…
With IRT, we can do any of the things we can do with a non-IRT test AND we can report results as standard deviations relative to the baseline and control group, where the control group has a positive value