Lecture 7 (Experimentation) Flashcards
Outline the 4 simple steps from Graverts ted talk on how to get started using experiments.
- Choose a question that is meaningful to you. It doesn’t to be big problems, but can be daily life things.
- Make use of the data you already collect, whether it is sales clics, book donation. Start with what you already have.
- Randomize. Flip a coin or use a random numinater. As long as you have a control group you are good.
- Test if it worked. If it did hurray, if it did not then now you know. We do not have to do this again.
without testing we would in theory through out a lot of money on ineffective policies.
Define features of experiments
Independent variable : The thing you change across conditions (e.g., XYZ or
ABC), Nudge (treatment).
Dependent variable : The thing you are trying to change (e.g., employment)
Random assignment: Randomly assign units to treatments
* Random assignment helps ensure that the treatment(s) vs. control differ only in the
treatment.
* All else equal, changing the independent variable in this way changes the dependent
variable this much.
* Without random assignment, you are ultimately observing a correlation (correlation ≠
causation).
Why is it so important that we have a random sample????
- Random assignment helps ensure that the treatment(s) vs. control differ only in the treatment.
- If we did not ramdomize, we could have an example where we give an intervention to a sample of people, then watch the outcome. Some people were affected by the intervention. Good. Let’s roll it out to the entire population tomorrow. But it might also just be something else that caused this effect. It means that something else than our intervention is creating this effect. Because we did not ramodize we can not see the control group having the same behavior. This could be an expensive problem if the intervention wore costly or if we did it instead of doing something else. That is why we need a randomized experiment.
- Without random assignment, you are ultimately observing a correlation (correlation ≠
causation).
But as we would see, just because we have a random sample, then there can still be things that can affect the result in a biased why depending on the contest and random sample.
Also if we after the intervention is done see that
So it would be ignorance to completely trust an experiment, without discussing how random the sample is.
Why is experiments better than our intuition and gutfeeling?
Example: with a randomized controlled trial (renewing driver license, adding an option to sign up for organ donation), where the audience is shown 8 different nudges and has to guess which one they would think worked the best to get people to sign up for organ donation. In this case all people chose a nudge with a social norm picture of a group of people. All the people were wrong , because this was not the most effective nudges and it actually performed worse than the control message.
The example lead to a conclusion that it is not always optimal to trust your intuition or gutfealing when you have to choose. Most cases you would be wrong and a lot of money is spend without any good coming out of it. Instead you should try to experimenting and testing which solution is the best. In that way we would know with more scientific background which solution is the best or did not work.
Linking this problem to policy making, where most choices is not tested can seem surprising. Imagine how much money governments could save and how effective they would be. Comparing it to the medicine and tech sector where everything is tested all the time, it seems like we are looking into a big gap in public policy making.
Why is context king?
Because a treatment (nudge) can work in on given context to change behavior, but in another context is has no effect. Therefore we need to consider context when making experiments.
What happens when you do not test?
Your intervention can have bigger backlash effect than the positive effect you want to nudge people into.
example: Google mini pink cupcakes. On google maps they tried to get fewer people to drive, to lower CO_2 emissions, by introducing how many calories you would burn by walking that route you would drive. It showed it in mini cupcakes and calories. But it had a very upsetting effect on people, who felt judge by the cupcake and some being angry that it was targeting women due to the pink color. Luckily for google they tested this in a small area, before letting it out on their platform. They never put on google maps due to the many negative responds on the test. A good example on why it is important to test.
Outcomes - Exercise
* Imagine a mental health app that wants to measure their
effectiveness
* They might measure “number of log-ins per day” as an outcome
* What might be problematic with that?
* What does “number of log-ins per day” tell us about
effectiveness?
* Do you have a better idea for an outcome?
It’s not a good measure for mental health effectiveness that people log in a lot. It could mean that if people lock in every day, on eg. head space, they are meditating, which is good for their mental health. But it could also just be the case, that the app was behaving badly and people had to log in several times for it to work.
Another idea could be something in the app that track improvements. A daily log before people went to sleep, were they went through their day reflecting and placing a score of how they felt. But this is based on feelings, not on concrete tested information.
In conclussion when companies use numbers of log ins or downloads to in this case check for effect on mental health, it would be a bad indicator.
What types of outcomes can we measure?
Some outcomes can be hard to measure, eg. how do we measure “women’s empowerment” or “fairness”?
Then is the behavior you track the same as what you are actually interested in?
SelfSelection
Self-selection example: Women writing on Facebook, my husband is … Here most likely there would be nice things following, because people on Facebook want to brag and show the good side of their lives. Here we could have concluded that, it was a very nice thing to in marriage. But making the same line on google where people want guidance instead of self promotion, we get a completely different picture. Eg. mean, addicted to porn, etc. In this context we would conclude something different.
If we run an experiment on Facebook around our followers vs on google we would get very different results. Context is king.
Study by the university of Illinois about the effectiveness of employee wellness programs (self
A) outline the research question and the design
B) based on figure 14, what is the results. Discuss
Question a)
Research question:
Do employee wellness programs really work?
Design:
They make two groups, which are randomized by trial.
treatment: The group is offered the program.
- sub group: those who chose to participate
- Sub group: those who didn´t
control: The group that was not offered the program. Not able to participate.
Question b)
The only result I understand we get, is that the people we observe is the one who where already in the beginning more likely to be active, have low medicine bills etc. They are more likely to participate and will bias the result. Why we could have concluded that the wellness program have a very positive effect, but it is mainly driven by the people who chose to participate in the study.
What is “Selection into the experiment”?
Here we look at the context and what kind of people who chose to participate in studies.
Level of randomization, what are the problems and spillover effects?
Individual level: Online experiments can be based on IP-adresses to randomly given the number of the address put people in different groups. But that might create people to be upset if eg. we were in the context of the wellness program then the people who is not offered the program can feel left out.
Instead of doing individual we could do team level, but this might create possible spillover effects from the treatment group to the control group. Example in a class some get extra knowledge, but they might chose to share that with some other students who where in the control group. Then we would have to further randomize, maybe on different years of grade or even on different schools, to stop spill over effects.
We could then choose to do even bigger groups, to avoid spill over effects, eg. different schools or firms, but that would then give us few independent observation relative to the individual example. Imagine we have 10 firms with 100 employees each. in the individual chase we had a 1000 “independent obs”, in the case of smaller groups we might get 250 “independent obs” and in the case of I even bigger groups, the entire firm as one, we get only 10 “independent obs”. That create a problem when we want to do statistics on that.
There is no right or wrong here, but a trade off between spillover effects or statistic power. The main take away is do you think people talk to each other or can I do something to get individual observation instead of classes, eg. using absalon targeting one individual instead of the entire class to avoid spill overs.
Access to treatment?
You should think about if the fair access to your treatment or does the access to the treatment make some people not participate.
Treatment difference. Are you testing a theory or a you testing a policy?
Overall the smaller the treatment difference, the closer you are to indtifying a true meachanism. Small treatment difference meaning zooming in on specific mechanisms, specific nudges.
If you work in the private sector or policy, you might just want to test how good it performance, not looking into how different smaller part of it works (mechanisms).
If you are in a research setting, like we are, we would like to test different mechanisms (how different nudges work to change bias) in different situations to see how good they work. The more we know about mechanisms, the easier it it to transfer knowledge.
Case Study: Get out the Vote
Do Phone Calls to Encourage Voting Work? Why Randomize? by Imai K. (2005).
a) What is the research question and please outline the design used.
b) What are the results
Question a)
Research Question:
The primary research question of the article is: “Do phone calls encouraging voting increase voter turnout?”
The second research quesiton looked at the effect of randomization, does it make sense randomizing?
This study aims to determine whether the Vote 2002 Campaign’s get-out-the-vote (GOTV) initiative, which involved making phone calls to potential voters before the 2002 U.S. congressional elections, had a measurable impact on voter turnout.
Research Design:
The study employs a randomized experimental design to assess the impact of phone calls on voter turnout. The key elements of the design are:
Randomization:
- The study randomly selected 60,000 individuals from a larger population of approximately 2 million eligible voters.
These individuals were assigned to the treatment group (those who received phone calls) and the control group (those who did not receive phone calls). The randomization ensures that any observed differences in voting behavior between these groups can be attributed to the intervention (phone calls) rather than pre-existing differences.
Treatment Group:
Volunteers made phone calls to encourage voting.
Among the 60,000 individuals called, only 25,000 answered the phone.
Control Group:
The comparison group consisted of randomly selected individuals who were not called.
Outcome Measurement:
Official voter records were used to determine whether each individual actually voted in the election.
Different statistical methods were employed to analyze the impact of the phone calls on voter turnout.
Comparison of Methods:
The study compares several evaluation methods, to be able to compare randomization against no randomization.
Question b)
Results:
Key Findings:
The randomized evaluation method produced the most reliable estimate, showing that phone calls had a small but statistically significant impact on voter turnout.
The estimated impact using different methods varied significantly, with some methods overestimating the effectiveness of the phone calls due to selection bias.
The most reliable estimate (adjusted for the fact that only 25,000 of 60,000 were actually reached) suggested an impact of only 0.4 percentage points.
Conclusion:
The study highlights the importance of using randomized experiments in policy evaluation to avoid biases and incorrect inferences. While phone calls had some effect on voter turnout, the impact was much smaller than initially reported by non-randomized methods.
*MILKMAN, Katherine L., et al. Megastudies improve the impact of applied behavioural
science
a) Define the research question and outline the design
b) describe the results from figure 15
Question a)
Research question
Here,
to address this limitation and accelerate the pace of discovery, we introduce the megastudy–a
massive field experiment in which the effects of many different interventions are compared in
the same population on the same objectively measured outcome for the same duration looking at how to increase participation in gyms.
Design:
Due to it being a mega study, the different scientists who participated used different interventions and designs, but they are compared in the same population on the same objectively measured outcome for the same duration of time.
Question a)
Results:
Given the figure we see that the green lines, where what the scientist where predicting and the red lines are the real results. So they were super wrong hehe.
Most of the intervention does not have an effect, are on 0.0 placebo control. There are not any correlation between those interventions who do good or bad.
So even experts in their field of behavioral science can be extremely wrong when guessing on what works. That is why it is so important that we test our interventions on a randomized sample before communcating it further.