Introduction to Management Flashcards
Data collected in an RCT are typically used to measure
We collect data because it is either part of our theory of change, it can help improve power, it can help us better measure CACE, and for generalizability. We collect personally identifiable information and Unique IDs only for operational purposes–to track individuals. We do not use this data, per se, in our analysis.
Hours spent helping child with homework would be an indicator for which part of the LogFrame?
“Hours spent helping with child with homework” is an indicator for “Parents get more involved in their children’s education at home, “ which in this LogFrame is an outcome.
In a log frame, what would be considered a “source of verification”?
A source of verification is where the data come from.
Ex: arrest records are the source of data.
Which of the following reasons (consistent with our theory of change and the results) explain why we see significantly more road improvements in West Bengal reservation villages than in Rajasthan reservation villages?
A final step in our theory of change was that public investments would better reflect women’s priorities in reservation villages. Indeed, in West Bengal, road construction was a higher priority (relative to men), and we saw an increase in road construction. In Rajasthan, this wasn’t the case. Note, moreover, that we do not know how the priorities of women in West Bengal compare to those of women in Rajasthan in absolute terms. According to our theory of change, what matters is the relative priorities between men and women, not the absolute levels of priority among women. For example, if women and men in Rajasthan both place a very high level of priority on road improvement, then we would not expect to see any changes in response to greater women’s representation. Meanwhile, in West Bengal, women rank road improvement as a moderate priority (lower than their counterparts in Rajasthan), but men rank road improvement as a low priority. Here, we would expect to see positive effects on road improvement from greater women’s representation.
Which of the following is most likely to be considered primary data in the evaluation of a social program?
- Tax records to measure income
- National Oceanic and Atmospheric Administration satellite data to measure rainfall for weather insurance programs.
- Hospital records to measure health status
- Census records to measure occupation
- Online survey to measure approval ratings for a politician
National Oceanic and Atmospheric Administration satellite data to measure rainfall for weather insurance programs.
Primary data is collected principally for the purpose of research or evaluation. An online survey is likely part of an evaluation. Tax, hospital, and census records are conducted either for administration, policy, or for others. NOAA data is likely by climate scientists for climate research, not by social scientists for social programs, but it is still collected for the purpose of research.
Our primary research asks whether our school-feeding program leads to better learning outcomes. Our secondary question is whether the impact is larger for those who are malnourished.
To answer our secondary question, when is the best time to collect indicators measuring nourishment
Baseline
Due to randomization, the proportion of malnourished children should be identical at baseline. As soon as the intervention begins, however, we may see the composition of this group begin to change due to the intervention. To answer our secondary question, we want to look at statistically similar groups (from the beginning).
Kelsey suggests that some accusations claiming that researchers are “experimenting on people,” are unjustified because …
The program is not being implemented by the researcher and would happen anyway
Kesley gives the example that if the government is rolling out a program to provide computers in classrooms, it will not necessarily send out forms asking parents for permission. The computers are the “experiment”. Researchers may come in after this decision has already been made to measure outcomes, which itself isn’t the “experiement”.
Empowerment is…
Data
An indicator
A response
A construct
Empowerment is a concept that has to be distilled into an indicator or question
Blood Pressure = 110/71 mm Hg is:
Data
An indicator
A response
A construct
Data
110/71 mm Hg is a specific measure–the number for a specific individual. In other words, it is a piece of data. Since it is an anthropometric measure, it is not part of a survey, and does not have a question or response process.
Discrimination is:
Data
An indicator
A response
A construct
Discrimination is a concept that has to be distilled into an indicator or question
Kilograms of rice per hectare is:
Data
An indicator
A response
A construct
This is an indicator, probably meant to measure the construct of rice yields. The data or response would be a specific number of kilograms per hectare
Outcome: annual consumption, Indicator: food expenditure in last week
This example may have problems with:
Validity
Reliability
Both
Validity
Outcome: annual consumption, Indicator: food expenditure in last month
This example may have problems with:
Validity
Reliability
Both
Validity and Reliability
Validity is to Reliability as:
Noise is to Bias Precision is to Noise Bias is to Accuracy Accuracy is to Precision Precision is to Accuracy
Validity, like accuracy, is the idea that we’re not systematically missing our target (the truth) in a particular direction. In measurement, our target is our construct. Reliability, like precision, is the idea that each subsequent attempt at measurement (or estimate) is consistently close to prior attempts.
What are the four stages of the response process?
Comprehension, Retrieval, Estimation, Answer
Comprehension: whether the respondents understand what is being asked; Retrieval of the necessary information from their brain; Using judgement to synthesize memories into an answer; Reporting their answer based on the reponse options given
Measurement error can be introduced at which stage(s)?
Indicator selection Respondent’s comprehension of the question Retrieval of information Estimation or judgment Reporting an answer
Measurement error can be introduced at all of these stages, whether it’s a problem with an indicator’s construct validity, or confusion at any stage in the response process
The response to the question, “Do you plan to marry your daughters before they are 18 years old?” should be considered:
A fact, because the respondent knows what their plan is today, even if the plan never materializes
A quasi-fact, because plans for marriage is a question about identity that typical categories do not capture
Subjective: because it has to do with an expectation, and at the moment of the responding, is known only to the respondent and cannot be verified
Any expectation is known only to the person responding and cannot be directly observed. It is therefore subjective.
A person’s occupation would be considered:
A permanent state of being
A fluctuating state of being
A habitual action or behavior
An episodic action or behavior
A fluctuating state of being
A person’s occupation is a state of being in that it is unlikely to change from day to day (it is not a behavior or action), however it can change at any time.
Which of the following questions is meant to measure an “attitude”?
Do you want your daughter to become a doctor?
Do think your daughter has the ability to become a doctor?
Do you believe your daughter will become a doctor?
Do you think women make good doctors?
an attitude is like a belief, but that also implies a normative judgment. Stating whether someone is a good doctors is a normative judgment. The others are perceptions, expectations, or aspirations (respectively).
Exclusive proxy indicator
One that is correlated with a specific construct, and not with other competing constructs.
An exclusive proxy indicator is one that measures the construct we care about, and likely cannot be explained by other factors. For example, pregnancy is an indicator of having been sexually active.
What is true about the Kling, Leibman, Katz method of creating a standardized index?
- Each individual component is weighted equally correct
- The unit of measurement for response options will not affect the relative weight of a component (e.g. using kilometers versus miles). correct
- To increase the relative weight of a particular category within an index (e.g. mobility) one can add components to that category
By standardizing, almost by definition, roughly half of the responses will be negative. The only concern with negatives is that they are consistent with respect to the index. Less of something bad should have the same sign as more of something good.
Field-coded question
In a field-coded question, surveyors ask an open-ended question, and then record the response using specific response options, similar to a close-ended question
open ended questions - pros and cons
The researchers may not anticipate all of the possible response options & It might take too long for the surveyor to list all of the response options if presented as a close-ended question
However, to convert open ended questions to usable data, one must code each individual’s response into possible response options, which can be subjective and can take a lot of time. Because ex-post coding often relies on the judgment of the coder, it’s possible this increases the potential for error.
Why might we want to use a close-ended categorical response option (i.e. ask people to select the option that reflects the appropriate range in which their response would fall) rather than an open-ended numerical response option (to get a precise number)?
If the respondent does not know a precise answer, selecting a category may be more accurate than responding with a precise number
Range-options for certain demographic characteristics (e.g. age range) can provide a bit more anonymity than precise numbers (e.g. birth date)
Categorical responses can be more difficult to analyze linearly because linear relationships are usually based on single numbers.
What is the difference between a Likert scale and a numerical rating scale?
Likert and numerical rating scales are nearly identical. What makes a numerical rating scale unique is that each response option (and sometimes points between un-labeled options) corresponds to a number.
In the past month, how many times have you skipped a meal?
A. 0 times
B. 1-5 times
C. 6-10 times
D. More than 10 times
What problem has been introduced in this survey question?
Vagueness: What is the definition of skipping a meal? Anything less than 3 meals per day? What is the definition of a meal?
A surveyor asks whether the household has made any large purchases in the past 30 days. The respondent happened to purchase a bicycle 40 days ago, so the respondent replies “a bicycle”.
What is the bias that has been introduced in this example?
Telescoping bias occurs when a respondent includes a behavior, action or event outside of the reference period and is particuarly common with “lumpy purchases”
In Country A, there was a study of a government agricultural extension program, where farmers are trained by government agronomists on the benefits of using fertilizer. A number of farmers in the treatment group report in the endline that they used more fertilizer than they actually did because after receiving the extension program, they recognized that using more fertilizer was “the correct answer”
What problem was introduced in this scenario?
Social desirability bias occurs when respondents give an answer that they believe is “socially acceptable or desirable”. In this case, it is not a measurement effect, as measurement effect affects behvior itself and not only the response to the question
We are studying the randomized rollout of a government program to provide electricity to villages, and its impact on learning outcomes. In the endline, we use mobile devices to collect data on literacy levels. However, when the endline is complete, we analyze the data and discover that in the control group, a large proportion of that data is missing. We call in our survey team and learn that the mobile devices would often run out of battery, and in some villages there was no place to recharge the device. This may have led to lost data.
Which of the following methods is least likely to introduce bias?
- Use the data as is since it was collected using the same method in both groups
- Return to both treatment and control villages and re-conduct the endline with paper surveys
- Return to the control villages to conduct the endline using paper surveys
- Return to the control villages with back up mobile chargers and conduct the endline using the same mobile devices
Return to both treatment and control villages and re-conduct the endline with paper surveys
Using the original electronic data would likely introduce attrition bias, since some villages in the control group are more likely to have misisng data since they are less likely to have electricity. Conducting the endline in the control group with a different method or at a different time might introduce systematic error that biases our results. Only surveying both groups at the same time with the same method would ensure we have no error.
Intermediate outcomes
Changes necessary to achieve the final outcomes. Usually changes in: • Knowledge & beliefs • Attitudes & aspirations • Capacity & ability • Decisions, behaviors & actions
Purpose of measurement
To measure outcomes (long-term, intermediate, first order, second order, inputs, outputs, etc.); covariates (provide background on respondents, classify respondents behaviors, reduces standard error); treatment compliance (individual & group level; predictors of compliance); heterogenous treatment effects; context for external validity
What are the four rows common in a Log Frame?
Impact (goal/overall objective)
Outcome (project objective)
Outputs
Inputs (activities)
What are the four columns common in a Log Frame?
Objectives/Hierarchy
Indicators
Sources of Verification
Assumptions/Threats
First-order questions in measurement
- What data do you collect?
- Where do you get it?
- When do you get it?
Where can we get data?
• Obtained from other sources
– Publically available
– Administrative data
– Other secondary data
Collected by researchers
– Primary data
Types and Sources of Data
Information provided by a respondent
○ Could be through a survey, exam results, etc.
○ Information about a person, household, possessions
Automatically generated
○ Automatic tollbooths – detailed individual data
○ Or, a sensor picking up data all the time (not about a single person)
Information NOT about a person/household/possessions
Pollution monitors, etc.
Still an active data collection process most likely, but not based on a person
Ways used to collect data on people
- Surveys
- Exams, tests, etc.
- Games
- Vignettes
- Direct Observation
- Diaries/Logs
- Focus groups
- Interviews
Main types of surveys
• Interviewer administered
– Paper-based
– Computer-assisted/ Digital
– Telephone-based
• Self-administered
– Paper
– Computer/Digital
When to collect data during the evaluation process
• Baseline • During the intervention – Process, Monitoring of intervention • Endline • Follow-up • Scale-up • Intervention: M&E
Concept of measurement (from construct to data)
Construct –> Indicators –> Data Collection (“Response”) –> Data
Goals of measurement
Accuracy
Unbiasedness
Validity
Precision
Reliability
Validity (in theory)
How well does the indicator map to the outcome?
(e.g. IQ tests -> intelligence)
Construct –> (Validity) –> Indicators
Reliability (in theory)
The measure is consistent and precise vs. “noisy”
Construct –> (reliability) –> Indicators –> (reliability) –> Data Collection (“Response”)
4 Steps of the Response Process
- Comprehension of the question
- Retrieval of information
- Judgement and estimation
- Reporting an answer
Response Process - Comprehension
How well the respondent understands the question
i.e. How many times did you consume rice this month?
Does this mean just pure rice, or any product that contained rice? Rice flour/rice crackers?
Response Process - Retrieval
When the respondent thinks about the question, and retrieves the information required to answer
Question - When you received your first measles vaccination, on a scale of 1-5, with 1 being painless, and 5 being unbearable painful: what was the level of pain?
Probably received this vaccination as a child; too long ago to retrieve accurate information; data that you do collect is likely very inaccurate
Response Process - Estimation/Judgement
When the respondent has to estimate/judge the answer (this should be minimized)
For example, I did “X” thing twice last week, so over the past month…that’s 2x’s a week over four weeks…so probably about 8 times.
Response Process - Response
Respondent actually gives a verbalized response at this point.
Even after the respondent has gone through comprehension/ retrieval/judgement - there may be some breakdown between the enumerator and respondent.
For example - if you ask about illegal drug use, the respondent might have an accurate answer but give an inaccurate one for fear of implications.
Objective vs. subjective facts
Objective - facts
Subjective - an opinion, attitude, perception, aspiration, expectations (not verifiable by external observation or records)
Quasi-facts - race, religion, ethnicity, gender
Here, the response could be motivated by objective factors (biology, etc.) or by subjective factors (personal identity)
Two ‘branches’ of facts
state of being
actions and behaviors
Permanent state of being
Permanent facts
i.e. date, district of birth
Fluctuating state of being
Can change at any time (i.e. age; district of residence)
…but sometimes in a predictable way ( age, years of education, years of experience); could change due to outside circumstances (HH size, marital status, number of children); others could be fluctuation but at some point become static (highest education level; number of children a woman has had)
Habitual behaviors
i.e. regularly attending school
Useful when asking about frequency
Episodic behaviors
One time and/or infrequent behaviors
i.e. the purchase of a TV
Key things to consider when trying to measure ‘facts’
Clearly define variables
– What is a household?
– Who can be considered household members?
– What is considered a room in a household? etc.
• Determine the level of precision you need for your study
• Useful to look at standard questionnaires for framing these questions
• Be aware of local context and culture
Subjective questions - what to measure?
Beliefs (Cognitive)
Expectations (behavior intentions)
Attitudes (Evaluative)
Subjective questions - Beliefs
Beliefs (Cognitive) - set of beliefs about the attitudinal object (i.e. how healthy are cigarettes on the following dimensions?)
Subjective questions - Expectations
Expectations (behavior intentions) - Respondents’ attitudes in relation to their future actions. (i.e. Do you plan to quit smoking in the next year?)
Subjective questions - Attitudes
Attitudes (Evaluative) - Evaluation of the object. (i.e. Do you associate smoking with being cool?)
How can we think about subjective questions in terms of the outcomes we are looking to measure?
Subjective questions are usually never the ultimate impact outcome we intend to measure. However, they are very useful as an intermediate outcome. If individuals are unaware of the dangers of cigarette smoke, we probably need to change their beliefs before expecting them to stop smoking.
Or subjective measures can help us understand the context and possibly assumptions under which our theory of change will work. For example - an information campaign on the dangers of smoking may not be as effective if people are already aware, but rather think it’s worth the cost to look cool.
Which constructs are particularly hard to measure? What is a common solution?
Sensitive questions will usually cause the respondent to not be honest about their answer, even if they know it.
Sometimes the respondent does not know the answer- Unknown
Use proxy indicators! Rather than directly asking them the sensitive question - can ask about correlated measures. Must be correlated with construction, and ensure the correlation is dynamic (in terms of the outcome we are measuring, etc.)
Exclusive indicator
Only one proxy indicator is needed to measure the construct
For example, as rice yields increase, a larger proportion could be used to replant for the following season, or they may now actually be able to sell some of that rice in the market or traded for other goods. Therefore, unless we are certain that all rice grown is consumed, then rice yielded is not an exclusive indicator of nutrition.
Exhaustive indicator
We sometimes want an exhaustive list of indicators for our construct. And we’re not always confident that the indicator always moves in the same direction as our construct. We may not want to rely only on one proxy alone.
So, for example, if we want to know total calorie consumption, we may need a full accounting of all the food our respondents consume. For each indicator or food item, we ask about the quantity consumed in terms of weight or serving. And then we can ask–then we can convert it into calories and add up all the calories at the end to get total caloric intake.
Index to measure construct
The middle ground between a single indicator and a comprehensive accounting of all things consumed would be an index.
For example - where we create a sample of food items–a consumption basket– and weigh each item’s contribution to the value of index in proportion to the amount a typical household consumes that item.
Component
An item in an index
Equal weighting in indices
Sum of all components, no weighting
If we believe our index components are comprehensive, or at least representative
Thematic clustering in indices
Weight components by themes
How to think about positive/negative signs when creating indices?
Need to take into account components that might be negatively correlated
We don’t want to naively add positives and negatives both signs of empowerment together and have them cancel each other out in the end. We just need to make sure we adjust the signs.
Opinion-based weighting indices
Sometimes, expert judgment is used. For many exams, the teacher may weight a section by how much time was spent on the topic during the semester, or by how important he or she feels it is to getting to the next level of difficulty.
Principle components Analysis
Unobserved Components Model
Seemingly Unrelated Regressions
there are statistical methods that weight index components by their actual or potential explanatory power.
What these methods all have in common is that they remove any correlation between components so that their latent attributes are not double or triple counted when contributing to the constructor we care about.
Standardized weighting (Kling, Leibman, Katz)
a method that standardizes individual components within the index before compiling them. This is also called a z-score index.
So for example, if one component was ranked on a scale between 1 and 100 and the other between 1 and 10, the former’s contribution would not be 10 times that of the ladder. On the other hand, the more components that are included within a particular theme, the more weight that theme has in the overall index.
Standardized weighting (Kling, Leibman, Katz) - 4 steps
• Determine the comparison group against which you will standardize
– Baseline
– Control group of the same round
We can use the entire sample from the baseline, or we can just use the control group observations from the end line. Obviously, if you don’t have a baseline measure, you’ll need to use the control group.
• Standardize individual components of the index
– Standardized variable = (variable – variable mean)/ variable standard deviation
we demean each observation. In other words, we take the average value of that component from the comparison group, either the baseline sample or the end line sample of the control group, and subtract it from each observation. Then we divide the demeaned value by the standard deviation of the comparison group.
• Average components together
• Standardize the final index
– Standardized index = (index – index mean)/ index standard deviation
We take the mean value of the entire index–and that’s the entire index not just the comparison sample–and divide it by the standard deviation of the index.
Common mistakes when creating indices
- Forgetting to recode variables that measure bad
- Forgetting to recode missing values
- Forgetting to account for missing values
Alternative index methods
• Principal component analysis (PCA)
– Reduces multidimensionality of data
• Seemingly Unrelated Regression Estimation (SURE)
With a principal component analysis or other similar methods, we ensure components are combined into themes to maximize variation. Or in other words, we adjust for inter-component correlation within each theme. And then we adjust for the inter-themed correlation at the end.
The SURE method (or seemingly unrelated regression estimation) does basically the same thing without creating themes first.
Types of questions & response types
Open-ended questions; Responses: Verbatim, Numeric
Field-coding questions
Close-ended questions: Single response, Multiple responses, response filters, ratings, rankings
Open-ended questions
Respondents are allowed to talk through responses, rather than respond to pre-selected codes
Issues with coding open-ended questions
– Coding free response material is: • Time consuming • Costly • Induces coding error Requires good interviewer skills in recognizing ambiguity of responses and probing (if required)
Open-ended verbatim questions
• Verbatim responses
– E.g.: What are the most significant health concerns faced by you and your family?
– Best to use when you don’t know too much about the likely responses
– Requires good interviewer skills in recognizing ambiguity of responses and probing (if required)
Open-ended numeric questions
Numeric responses
– E.g.: What is your age?
– Often associated with demographic variables
– E.g. How many times did you visit the hospital in the past 30 days?
– Units must be explicit
– E.g. How many times did you consume in the past 7 days?
The fact that we’re asking for a numeric response is restricting enough that it’s easy to convert into data.
However, we may run into problems if there are implied units, but those units are not made explicit.
Close ended questions
Respondents are presented with a set of pre-coded responses to choose from
• Response categories usually generated through cognitive interviews, focus group discussions and pretesting
• Respondents are given both the topic and the dimensions on which answers are wanted
Often field-tested first, to ensure there isn’t a lot of misinterpretation
• Single choice responses
yes/no or true/false question
Did you work as a hired laborer in the autumn season?
A. Yes
B. No
• Multiple choice responses
Another example is a typical standardized exam where you fill in the bubble A, B, C, D, or E. Sometimes these are called multiple choice questions. However, they should be not confused with multiple choice responses. Because of the potential for confusion, we usually refer to multiple choice responses as choose all that apply or select all that apply questions
In which seasons did you work as a hired laborer during this year? (SELECT ALL THAT APPLY) A. Autumn B. Spring C. Summer D. None
Response scales - what 3 types?
– Likert
– Numeric
– Frequency
benefits to using ranges for response options
First, it could mitigate privacy concerns the respondent may have. Or if it’s a question that requires estimation or recall–for example, how much rice did you consume last month–it could help the respondent if they’re unsure of the precise answer.
Forcing them to give a numeric answer rather than choosing a range may lead them to round up or round down, which could bias the results.
Sometimes we want to know the reasoning behind people’s opinions, actions, behaviors, or decisions. Even if we see a pattern amongst a majority of the population, that pattern may not hold for each and every respondent. It would be presumptuous to assume it did.
Likert Scale
Do you strongly agree, agree, neither agree nor disagree, disagree, or strongly disagree?
Central tendency bias
Likert - Bipolar scale
Sometimes it’s set up so that the first option is very positive like strongly agree. And the second option is very negative like strongly disagree. If it were numeric, it might range from say plus 2 to minus 2. This would be called a bipolar scale.
Likert - Unipolar scale
Other times, it’s set up so that the last option is equivalent to a zero. So do not agree rather than strongly disagree. So the numeric equivalent might be zero to four.
Central tendency bias
• One thing to note here is the inclusion of a middle alternative. In this case, the neutral option. A middle alternative offers an indifference point between being for or against a particular view.
Providing a middle alternative has pros and cons. It may be the best option for those who are truly indifferent. However, sometimes respondents will subconsciously default to a middle option because cognitively it is least taxing.
This is known as central tendency bias. Central tendency bias may lead a disproportionate number of responses taking the middle option, meaning it won’t reflect the true underlying distribution of our population.
Numeric scales
If we want a larger spread than a typical five point Likert provides us, or if we want to quantify the responses as a continuous variable, we can use numeric scales.
Numeric scales help us extract a bit more granularity out of our sample if we believe there is true underlying variance in opinions.
So for example, on a scale from 0 to 10, how much do you agree with the following statement?
Gumann scale
Only two options given - agree or disagree
Frequency scale
Similar to a Likert scale
For example, how often do you visit your child’s school? One set of responses could be similar to a Likert–never, rarely, sometimes, often, always. The responses represent some implied quantity. But the magnitude of those quantities is subjective.
Alternatively, we can use frequency scales that are closer to reflecting true numbers. For example, daily, weekly, yearly.
Again, this doesn’t give us a precise answer and makes analysis slightly more difficult if we hope to estimate a linear relationship between this variable and some other. But it may make it easier for our respondent to translate a vague estimate into a response.
Rankings
This is particularly useful when respondents do not have an absolute sense of quantity, but they do have a relative sense. They can sort the response options in some way or the other.
Which teaching-learning materials do you use most often? (Rank the top 3)
Options: Workbooks, flipcharts, textbooks, maps, flash cards, games etc.
Field coding
a version of an open response question when the response options are not given to the respondent. So the respondent answers in his or her own words. Then the surveyor codes these responses using pre-determined response categories.
This is useful when we don’t want to prompt the respondent with possible options, because perhaps out of convenience, they may just select one, several, or all of the responses given to them without even thinking.
It is especially true if the response options might signal to the respondent which answers the researcher considers quote-unquote acceptable. However, it requires a bit more faith in our surveyors, or at least a bit more skill.
Measurement error: Vagueness
Vague concepts where respondents may interpret the question in a different way
Make sure to define vague concepts
Measurement Error: Completeness
The response categories do not include all categories that can be expected as a response
Pilot question to make sure that categories are exhaustive
Measurement Error: Negatives
Questions that include negatives can be confusing to the respondent and lead to misinterpretations.
Avoid unnecessary negatives
Measurement Error: Overlapping Categories
The categories overlap each other.
Make sure that all categories are mutually exclusive
Measurement Error: Presumptions
The question assumes certain things about the respondent
Use filters and skip patterns
Measurement Error: Framing effect
People react to a particular choice in different ways depending on how it is presented i.e. prefer gains over losses
Try to be neutral when framing questions
Measurement Error: Recall Bias
People may retrieve recollections regarding events or experiences differently
You can ask respondents to keep a diary or save their receipts
Measurement Error: Anchoring Bias
People tend to rely too heavily on the first piece of information seen
Avoid adding anchors to your questions
Measurement Error: Telescoping Bias
People perceive recent events as being more remote than they are (backward telescoping) and distant events as being more recent than they are (forward telescoping)
Visit once at the beginning of the reference period. Then ask, “since the last time I visited you, have you…?”
Measurement Error: Social Desirability Bias
Tendency of respondents to answer questions in a manner that is favorable to others i.e. emphasize strengths, hide flaws, or avoid stigma
Ask indirectly, ensure privacy
Measurement (Survey) effects
- Act of being surveyed changes subsequent behavior
- Particularly relevant for panel surveys where there are multiple interactions with respondents
More data are better for analysis, But measurement effects could change the interpretation of subject behavior. Consider which questions could be asked at endline only, or obtained through non-survey methods
Bias is uncorrelated with treatment
Imagine we have a measurement instrument that systematically overestimates the value of our outcome, but it does so the same amount in both the treatment and control groups
If we look at either the difference in our outcome between baseline and endline, or between the treatment and control groups, we’ll estimate the true difference and the true impact without bias.
This is an example of where the magnitude of bias is totally uncorrelated with the treatment assignment.
Bias is correlated with treatment
One group (treatment or control) is over-reporting positive or negative results, leading to an inaccurate assessment of whether or not there was an impact
(i.e. exaggerating positive behavior in treatment group would make it seem like there was a large impact, when, in fact it’s wrong & biased)
To reduce the chance of bias, for all treatment groups we should collect data with:
For all treatment groups we should collect data with:
– Same enumerators - Blinding to the treatment assignment
– Same time period
– Same methods
– Same incentives
We want to make sure that the differences that we measure between the two groups is due to the treatment and the treatment alone, not surveyor characteristics or biases.
A biased measure will bias our impact estimates. True/False
It depends. If the bias differs systematically between the treatment and control groups, it will introduce error. If the bias is identical for both groups, then it may not introduce error.
In education policy circles, there is a contentious debate about the role of standardized exams. Opponents argue that standardized exams incentivize teachers to “teach to the test”. In other words, opponents take issue with the _____ of exams as a measure of learning levels
Validity.
Oppenents take issue with whether standardized exams are a valid measure of learning levels since one can do well on a test without having mastered concepts, or vice versa.