Measuring Learning Flashcards

1
Q

Explain current enrollment trends in developing countries (and their relationship with GDP)

A

Enrollment in today’s poor countries is far higher than enrollment was in rich countries when those rich countries were poor. GDP today is a strong predictor of learning levels, given that most countries fall pretty close to the prediction line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When it comes to testing validity, in an ideal world, policymakers should care most about tests that:

A

Predict longer term outcomes that we care about

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

predictive validity of a test

A

The predictive validity of a test is its ability to predict longer-term outcomes such as income, crime, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

concurrent validity of a test

A

The concurrent validity of a test is how it correlates with other validated tests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Convergent-discriminant validity

A

Convergent-discriminant validity is whether a test is correlated more with tests that measure similar concepts, and less with tests that measure different concepts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Test-retest reliability

A

Consistency with which a test measures any given skill

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Administrative data are often used to determine the resources that schools, students and households receive. Because of these policies:

A

Enrollment and attendance both tend to be inflated

Policies tend to give schools more resources when they have higher enrollment and provide incentives to parents based on attendance. Therefore, schools have incentives to inflate enrollment and attendance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the most basic measure of teacher effort?

A

Teacher attendance is “the most basic measure of teacher effort”. It can be measured through principal and student surveys. However, unless teacher behavior is an explicit step in the theory of change, it may not be necessary given how costly effort can be to collect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Barriers to school participation?

A

Convenience and access, out of pocket costs, health issues, underestimate the long-term benefits of education, discount the future

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How common is absenteeism of teachers/service providers in developing countries?

A

Absenteeism is widespread and unpredictable

Even when present, often not teaching

Few service providers face a serious threat of being fired for excessive absences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

With almost 100% primary enrollment, why are students still struggling to learn?

A

Enrollment itself doesn’t mean that students are regularly attending school

Being in school does not mean that children are learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the main approach we could use to improve learning in developing countries?

A

Many options here…but our focus is to pivot expenditure from less to more cost effective policies to
improve outcomes at any given level of per capita income

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Attendance conditional on enrollment

A

fraction of those enrolled who are present on a given day

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

School attendance in the population

A

Percentage of school days the average child in a given

population is in school

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is it important to collect both enrollment and attendance data for a study?

A

– Need to assume the program did not change the attendance rate of those enrolled.
– May bias measures of impact and cost-effectiveness
– Good to supplement with direct attendance data as well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

2 ways to measure teacher attendance

A

– Teacher attendance records (but often fudged)

– Direct observation during surprise visits (need to do this early in the visit)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

4 ways to measure teaching efforts

A

– Classroom observations
– Student surveys
– Principal surveys
– Teacher knowledge tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

2 ways to measure teacher knowledge

A

– Subject matter knowledge

– Subject-specific pedagogical knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Purpose of classroom observations

A

systematize observers’ perceptions of teacher quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Classroom observations - should they be short or long?

A

Short observations are efficient. They offer more information than single observations of the same length.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Classroom observations - should teachers be able to choose their own lessons?

A

Teachers can choose their lessons. It doesn’t make it harder to identify effective teachers; in fact, it makes it easier.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Classroom observations - How should you incorporate principals?

A

Principals are useful observers. They rate their own teachers higher, but their ratings are highly correlated with those of other observers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Classroom observations - should you add another observer?

A

Adding an observer pays off more than adding another lesson.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Classroom observations - predictive validity

A

Classroom observation ratings on a given year predict teachers’ value-added on the following year, after random assignment

However, predictive validity varies across instruments
and along the performance distribution (some instruments are better at identifying low- or high-performing teachers)

Predictive validity also varies across subjects
(all instruments are better at predicting value-added in math) and varies according to the types of skills assessed

25
Q

Student surveys - purpose

A

measure students’ perceptions of teacher quality

26
Q

Tripod survey - length & structure?

A

Length:
– Full: 67 questions (elementary) or 92 questions (secondary)
– Lite: 36 questions (elementary and secondary)
• Structure: 7 “Cs”
1. Care (does the teacher care about the student?)
2. Control (is the teacher in control of the classroom?)
3. Clarify (does the teacher clarify difficult concepts?)
4. Challenge (does the teacher challenge students?)
5. Captivate (does the teacher keep students’ attention?)
6. Confer (does the teacher engage students in discussions?)
7. Consolidate (does the teacher recap/review material?)

27
Q

Tripod survey - main elements

A

Not organized by “Cs” to avoid “priming” students

Short, age-appropriate statements that children can understand

Some statements are reverse-coded to contribute to the score for each “C”

Likert scale for children to indicate the extent to which they agree with the statement (sometimes expressed in terms of frequency)

28
Q

Principal surveys - purpose

A

systematize principals’ perceptions of teacher quality

29
Q

Principal surveys - domains assessed

A
– Overall teaching effectiveness
– Dedication and work ethic
– Organization
– Classroom management
– Raising student achievement (in math and reading)
– Role model for students
– Student satisfaction with teacher
– Parent satisfaction with teacher
– Positive relationship with colleagues
– Positive relationship with administrators
30
Q

Principal surveys - predictive validity

A

Principals can predict teacher effectiveness with a single question on their overall effectiveness.

Principals are skeptical of identifying poor performers
even when there are no stakes

31
Q

Teacher knowledge tests - purpose

A

measure teachers’ content knowledge, subject-specific

pedagogical knowledge, or understanding of student errors

32
Q

School management quality surveys - purpose

A

measure quality of school management (usually in the context of interventions to improve governance)

33
Q

World Management Surveys adapted for education

A

developed by Bloom, Lemos, Sadun, Van Reenen (2015)

• Management quality measured on:
– Operations
– Monitoring
– Target Setting
– People Management

Instrument recently adapted for developing countries by creating finer gradations in the 5-point scale

34
Q

What is the main lesson regarding the use of principal surveys to measure teacher effort?

A

They produce measures of effort that effectively predict student achievement

Principal surveys of teacher effort are remarkably predictive of teacher value added, even though there is reluctance to identify weak performers. It is management indices (not teacher effort) that tend to cluster at low levels of the scale. We have no evidence of whether or not they predict teacher knowledge.

35
Q

In an impact evaluation of an intervention that gives 4th-grade teachers incentives for improvements in reading, what might we be worried about if we measured learning only using an oral test that measures basic literacy (can the student read a sentence)?

A

It may be subject to ceiling effects where the distribution is censored at higher levels of achievement

Our literacy test doesn’t measure potential negative side effects of the incentive program

The test may not be able to pick up any differences among the highly literate students. Efforts to improve reading may increase at the expense of time teaching another subject and we are not measuring that. Oral tests can indeed be adaptive (unlike paper-pencil tests).

36
Q

What is a good test score?

A

• Appropriate to the context
– Major need for piloting, adaptation of instruments
• Measures what we think it measures
– We want to measure learning, not test-taking skills or speed
• Focused on dimensions that we think the intervention might improve
– Requires thinking carefully about what kind of test domains we want to focus on
– Also requires thinking about how the assessment might be ‘gamed’

37
Q

What is a good test score? - Distribution

A

• Continuous well-distributed measure of student achievement
– No ceiling or floor effects
– Not be “too easy”, “too hard” or “too short”
• This Goldilocks zone can often be very hard to achieve!

38
Q

What is a good test score? - Discrimination

A

• Tests should be discriminating i.e. informative at all levels of ability
– should be able to distinguish differences in absolute achievement around 10th percentile as well as around median ability
– This is often hard to do:
• PISA, TIMSS etc. not informative at very low achievement levels
• ASER not informative at high achievement levels

39
Q

What is a good test score? Dynamic comparability

A

dynamic comparability is a test that allows you to measure progress of student learning over time.

40
Q

What is a good test score? - Cross-sectional comparability

A

cross-sectional comparability is something that allows you to place a student in a wider distribution of contemporaries. So this could be a peer group in the same state, in the same country, an international peer group–and that’s cross-sectional comparability.

41
Q

What is a good test score? - Benchmarking

A

if there’s an absolute standard out there of what is considered a grade-appropriate competence, is how are your kids doing relative to that benchmark?

42
Q

The main purpose of using a common subset of questions that are repeated across tests is to ensure that:

A

Achievement can be compared across time and samples

43
Q

When designing a test, how should we think about grade-appropriate tests?

A

Grade-appropriate tests are particularly inappropriate for many developing country contexts (kids are so far behind in learning)

Try to design a test that contain items targeting a wide distribution of achievement

44
Q

When designing a test, how should we think about choosing each item?

A

Each time should map into a concrete skill that we want to test, there should be a subset of items repeated across rounds for comparability and a subset of items should be drawn from other assessments

45
Q

When designing a test, how should we think about language?

A

It should not be assumed that item properties are maintained in translation

46
Q

3 common ways a test is administered

A

Individually, group-oral, written

47
Q

Advantages vs. disadvantages between ways tests are adminstered

A

Individual oral much better for assessing children at young ages but very burdensome in the field
– Group oral attempts to replicate above at scale but classroom management is not easy, answers less precise
– Written tests are ideal for later grades but with a strong possibility of floor effects in primary grades

48
Q

Cognition test type?

A

Raven’s matrices

49
Q

Early Grade Learning test type?

A

EGRA, ASER

50
Q

Higher Level Learning test type?

A

SAT, GMAT

51
Q

Learning outcomes are often reported in terms of standard deviations rather than raw test scores primarily because:

A

Doing so allows us to compare results across studies that use different tests

52
Q

Item Response Theory

A

it allows you to compare kids in a common distribution so that you’re able to do better cross-sectional comparison and better over time comparison, even though the content of the test itself might be changing.

Models the probability that an individual with given ability will get an item right

the most important advantage of IRT is the ability to link across tests and over time.

53
Q

item characteristic curve.

A

Maps the trait (ability or knowledge) to the proportion correct

54
Q

guessing parameter

A

The probability that an examinee with no ability or knowledge will answer a question correctly (basically guesses and by chance gets it right)

Where the curve intersects with Y axis?

55
Q

Difficulty parameter

A

How difficult the question is; level of ability an examinee needs to get the question right with probability (1+c/2).

If you move the curve to the right, the difficulty will increase.

Mid-point of the curve; on x-axis

56
Q

Discrimination parameter

A

Measure of how well the question is able to distinguish between examinees of different ability/knowledge. How steep the curve is.

if this ICC is much flatter, then that tells you that even kids who don’t know much could get it right, and kids who know a lot could get it wrong.

57
Q

What are we able to do with a test designed with IRT, that we are unable to do with a test that was not designed with IRT?

A

Report treatment effects in standard deviations relative to the absolute progress made in the control group

58
Q

Using a simpler non-IRT test, we can report…

A

Using a simpler non-IRT test, we can report total scores, total relative to both the control group (simple difference),and relative to the baseline (pre-post); we can report “improvements” relative to the control group (difference-in-difference), and we can report improvements as percentage with either the baseline in the denominator, the control group total in the in the denominator, the control group improvement in the denominator or even the control group percentage gain in the denominator. Or we can report results as standard deviations (always normalizing the control group to equal zero).

59
Q

With an IRT test, we can report…

A

With IRT, we can do any of the things we can do with a non-IRT test AND we can report results as standard deviations relative to the baseline and control group, where the control group has a positive value