A/B Testing Flashcards
What is goal of AB testing?
to say hwhich version of a product performs better with users. objectively compare and contrast aspects of product choices so we can make data driven decisions
Different types of ab tests?
AB test, ABN test where N stands for number of versions being tested (used by indeed), the sis best for major layout or design decisions; lastly Multivariate testing
What are some usual metrics indeed would test?
click through rate, applications per week, job listings per week, bocce rate
How long should you run an AB test?
Given by Duration (weeks) = (Nv · Nu ) / (p · M/4 )
Nv = number of variants
Nu = number of users needed per variant (from the above table)
p = fraction of users in this test (e.g., if this test runs on 5% of users, p = 0.05)
M = MAU (monthly active users)
Looking at the above formula, as your fractio of users and monthly users goes up, the less duration you need to run your test. Also, as the number of variants in your experiment and number of users per variant increases, the longer you need to run your test.
What is a customer funnel?
users flow down the funnel, with more users at top. homepage visits-> exploring site -> creating account -> submitting app
When can we use binomial distribution?
2 types of outcomes
Independent events (outcome of one coin doesn’t affect outcome of another)
Identical distribution
P is same for all events
The usefulness of binomial distribution is that now we can estimate what the standard error of the observed data will be, and can calculate the confidence intervals.
If the binomial distribution is large enough, then we can use the normal distribution assumption. A good estimate of this is if Np(hat) > 5, where N is number of samples and p(hat) is the calculated probability (# of users who clicked / # of users).
To use normal: check Np(hat) > 5
When do we know we can use normal distribution estimate?
check N*p(hat) > 5
Equation for margin of error?
Margin of error = Z-score * sqrt(mu,hat*(1-mu*hat) / N) -> more samples, the small the margin; and more uncertain the probabilities, the bigger the range since the observed value is more likely to be one class or another. Margin of error = 2.58 * sqrt(0.15*0.85/2000)
When to use two tailed vs one tailed test?
The null hypothesis and alternative hypothesis proposed here correspond to a two-tailed test, which allows you to distinguish between three cases:
A statistically significant positive result
A statistically significant negative result
No statistically significant difference.
Sometimes when people run A/B tests, they will use a one-tailed test, which only allows you to distinguish between two cases:
A statistically significant positive result
No statistically significant result
Which one you should use depends on what action you will take based on the results. If you’re going to launch the experiment for a statistically significant positive change, and otherwise not, then you don’t need to distinguish between a negative result and no result, so a one-tailed test is good enough. If you want to learn the direction of the difference, then a two-tailed test is necessary.
How do you assess the statistical significance of an insight?
You would perform hypothesis testing to determine statistical significance. First, you would state the null hypothesis and alternative hypothesis. Second, you would calculate the p-value, the probability of obtaining the observed results of a test assuming that the null hypothesis is true. Last, you would set the level of the significance (alpha) and if the p-value is less than the alpha, you would reject the null — in other words, the result is statistically significant.
Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?
A long-tailed distribution is a type of heavy-tailed distribution that has a tail (or tails) that drop off gradually and asymptotically.
3 practical examples include the power law, the Pareto principle (more commonly known as the 80–20 rule), and product sales (i.e. best selling products vs others).
It’s important to be mindful of long-tailed distributions in classification and regression problems because the least frequently occurring values make up the majority of the population. This can ultimately change the way that you deal with outliers, and it also conflicts with some machine learning techniques with the assumption that the data is normally distributed.
What is the Central Limit Theorem? Explain it. Why is it important?
“The central limit theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size gets larger no matter what the shape of the population distribution.” [1]
The central limit theorem is important because it is used in hypothesis testing and also to calculate confidence intervals.
What is the statistical power?
‘Statistical power’ refers to the power of a binary hypothesis, which is the probability that the test rejects the null hypothesis given that the alternative hypothesis is true. [2]
Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?
Selection bias is the phenomenon of selecting individuals, groups or data for analysis in such a way that proper randomization is not achieved, ultimately resulting in a sample that is not representative of the population.
Understanding and identifying selection bias is important because it can significantly skew results and provide false insights about a particular population group.
Types of selection bias include:
sampling bias: a biased sample caused by non-random sampling
time interval: selecting a specific time frame that supports the desired conclusion. e.g. conducting a sales analysis near Christmas.
exposure: includes clinical susceptibility bias, protopathic bias, indication bias. Read more here.
data: includes cherry-picking, suppressing evidence, and the fallacy of incomplete evidence.
attrition: attrition bias is similar to survivorship bias, where only those that ‘survived’ a long process are included in an analysis, or failure bias, where those that ‘failed’ are only included
observer selection: related to the Anthropic principle, which is a philosophical consideration that any data we collect about the universe is filtered by the fact that, in order for it to be observable, it must be compatible with the conscious and sapient life that observes it. [3]
Handling missing data can make selection bias worse because different methods impact the data in different ways. For example, if you replace null values with the mean of the data, you adding bias in the sense that you’re assuming that the data is not as spread out as it might actually be.
Provide a simple example of how an experimental design can help answer a question about behavior. How does experimental data contrast with observational data?
Observational data comes from observational studies which are when you observe certain variables and try to determine if there is any correlation.
Experimental data comes from experimental studies which are when you control certain variables and hold them constant to determine if there is any causality.
An example of experimental design is the following: split a group up into two. The control group lives their lives normally. The test group is told to drink a glass of wine every night for 30 days. Then research can be conducted to see how wine affects sleep.
Is mean imputation of missing data acceptable practice? Why or why not?
Mean imputation is the practice of replacing null values in a data set with the mean of the data.
Mean imputation is generally bad practice because it doesn’t take into account feature correlation. For example, imagine we have a table showing age and fitness score and imagine that an eighty-year-old has a missing fitness score. If we took the average fitness score from an age range of 15 to 80, then the eighty-year-old will appear to have a much higher fitness score that he actually should.
Second, mean imputation reduces the variance of the data and increases bias in our data. This leads to a less accurate model and a narrower confidence interval due to a smaller variance.
What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset. Also, explain what an inlier is and how you might screen for them and what would you do if you found them in your dataset.
An outlier is a data point that differs significantly from other observations.
Depending on the cause of the outlier, they can be bad from a machine learning perspective because they can worsen the accuracy of a model. If the outlier is caused by a measurement error, it’s important to remove them from the dataset. There are a couple of ways to identify outliers:
Z-score/standard deviations: if we know that 99.7% of data in a data set lie within three standard deviations, then we can calculate the size of one standard deviation, multiply it by 3, and identify the data points that are outside of this range. Likewise, we can calculate the z-score of a given point, and if it’s equal to +/- 3, then it’s an outlier.
Note: that there are a few contingencies that need to be considered when using this method; the data must be normally distributed, this is not applicable for small data sets, and the presence of too many outliers can throw off z-score.
Interquartile Range (IQR): IQR, the concept used to build boxplots, can also be used to identify outliers. The IQR is equal to the difference between the 3rd quartile and the 1st quartile. You can then identify if a point is an outlier if it is less than Q1–1.5*IRQ or greater than Q3 + 1.5*IQR. This comes to approximately 2.698 standard deviations. Other methods include DBScan clustering, Isolation Forests, and Robust Random Cut Forests.
An inlier is a data observation that lies within the rest of the dataset and is unusual or an error. Since it lies in the dataset, it is typically harder to identify than an outlier and requires external data to identify them. Should you identify any inliers, you can simply remove them from the dataset to address them.
How can you test if your dad is normally distributed?
You can perform Shapiro Will test
scipy.stats.shapiro(x)[source]
Perform the Shapiro-Wilk test for normality.
The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.
How do you handle missing data? What imputation techniques do you recommend?
There are several ways to handle missing data:
Delete rows with missing data
Mean/Median/Mode imputation
Assigning a unique value
Predicting the missing values
Using an algorithm which supports missing values, like random forests
The best method is to delete rows with missing data as it ensures that no bias or variance is added or removed, and ultimately results in a robust and accurate model. However, this is only recommended if there’s a lot of data to start with and the percentage of missing values is low.
You have data on the duration of calls to a call center. Generate a plan for how you would code and analyze these data. Explain a plausible scenario for what the distribution of these durations might look like. How could you test, even graphically, whether your expectations are borne out?
First I would conduct EDA — Exploratory Data Analysis to clean, explore, and understand my data. See my article on EDA here. As part of my EDA, I could compose a histogram of the duration of calls to see the underlying distribution.
My guess is that the duration of calls would follow a lognormal distribution (see below). The reason that I believe it’s positively skewed is because the lower end is limited to 0 since a call can’t be negative seconds. However, on the upper end, it’s likely for there to be a small proportion of calls that are extremely long relatively.
Explain likely differences between administrative datasets and datasets gathered from experimental studies. What are likely problems encountered with administrative data? How do experimental methods help alleviate these problems? What problem do they bring?
Administrative datasets are typically datasets used by governments or other organizations for non-statistical reasons.
Administrative datasets are usually larger and more cost-efficient than experimental studies. They are also regularly updated assuming that the organization associated with the administrative dataset is active and functioning. At the same time, administrative datasets may not capture all of the data that one may want and may not be in the desired format either. It is also prone to quality issues and missing entries.
You’re about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it’s raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that “Yes” it is raining. What is the probability that it’s actually raining in Seattle?
You can tell that this question is related to Bayesian theory because of the last statement which essentially follows the structure, “What is the probability A is true given B is true?” Therefore we need to know the probability of it raining in London on a given day. Let’s assume it’s 25%.
P(A) = probability of it raining = 25%
P(B) = probability of all 3 friends say that it’s raining
P(A|B) probability that it’s raining given they’re telling that it is raining
P(B|A) probability that all 3 friends say that it’s raining given it’s raining = (2/3)³ = 8/27
Step 1: Solve for P(B)
P(A|B) = P(B|A) * P(A) / P(B), can be rewritten as
P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
P(B) = (2/3)³ * 0.25 + (1/3)³ * 0.75 = 0.258/27 + 0.751/27
Step 2: Solve for P(A|B)
P(A|B) = 0.25 * (8/27) / ( 0.258/27 + 0.751/27)
P(A|B) = 8 / (8 + 3) = 8/11
Therefore, if all three friends say that it’s raining, then there’s an 8/11 chance that it’s actually raining.
There’s one box — has 12 black and 12 red cards, 2nd box has 24 black and 24 red; if you want to draw 2 cards at random from one of the 2 boxes, which box has the higher probability of getting the same color? Can you tell intuitively why the 2nd box has a higher probability
The box with 24 red cards and 24 black cards has a higher probability of getting two cards of the same color. Let’s walk through each step.
Let’s say the first card you draw from each deck is a red Ace.
This means that in the deck with 12 reds and 12 blacks, there’s now 11 reds and 12 blacks. Therefore your odds of drawing another red are equal to 11/(11+12) or 11/23.
In the deck with 24 reds and 24 blacks, there would then be 23 reds and 24 blacks. Therefore your odds of drawing another red are equal to 23/(23+24) or 23/47.
Since 23/47 > 11/23, the second deck with more cards has a higher probability of getting the same two cards.
What is: lift, KPI, robustness, model fitting, design of experiments, 80/20 rule?
Lift: lift is a measure of the performance of a targeting model measured against a random choice targeting model; in other words, lift tells you how much better your model is at predicting things than if you had no model.
KPI: stands for Key Performance Indicator, which is a measurable metric used to determine how well a company is achieving its business objectives. Eg. error rate.
Robustness: generally robustness refers to a system’s ability to handle variability and remain effective.
Model fitting: refers to how well a model fits a set of observations.
Design of experiments: also known as DOE, it is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variable. [4] In essence, an experiment aims to predict an outcome based on a change in one or more inputs (independent variables).
80/20 rule: also known as the Pareto principle; states that 80% of the effects come from 20% of the causes. Eg. 80% of sales come from 20% of customers.
Give examples of data that does not have a Gaussian distribution, nor log-normal.
Any type of categorical data won’t have a gaussian distribution or lognormal distribution.
Exponential distributions — eg. the amount of time that a car battery lasts or the amount of time until an earthquake occurs.
What is root cause analysis? How to identify a cause vs. a correlation? Give examples
Root cause analysis: a method of problem-solving used for identifying the root cause(s) of a problem [5]
Correlation measures the relationship between two variables, range from -1 to 1. Causation is when a first event appears to have caused a second event. Causation essentially looks at direct relationships while correlation can look at both direct and indirect relationships.
Example: a higher crime rate is associated with higher sales in ice cream in Canada, aka they are positively correlated. However, this doesn’t mean that one causes another. Instead, it’s because both occur more when it’s warmer outside.
You can test for causation using hypothesis testing or A/B testing.
Give an example where the median is a better measure than the mean
When there are a number of outliers that positively or negatively skew the data.
What is the Law of Large Numbers?
The Law of Large Numbers is a theory that states that as the number of trials increases, the average of the result will become closer to the expected value.
Eg. flipping heads from fair coin 100,000 times should be closer to 0.5 than 100 times.
How do you calculate the needed sample size?
You can use the margin of error (ME) formula to determine the desired sample size.
t/z = t/z score used to calculate the confidence interval
ME = the desired margin of error
S = sample standard deviation
When you sample, what bias are you inflicting?
Potential biases include the following:
Sampling bias: a biased sample caused by non-random sampling
Under coverage bias: sampling too few observations
Survivorship bias: error of overlooking observations that did not make it past a form of selection process.
How do you control for biases?
There are many things that you can do to control and minimize bias. Two common things include randomization, where participants are assigned by chance, and random sampling, sampling in which each member has an equal probability of being chosen.
What are confounding variables?
A confounding variable, or a confounder, is a variable that influences both the dependent variable and the independent variable, causing a spurious association, a mathematical relationship in which two or more variables are associated but not causally related.
What is A/B testing?
A/B testing is a form of hypothesis testing and two-sample hypothesis testing to compare two versions, the control and variant, of a single variable. It is commonly used to improve and optimize user experience and marketing.
Infection rates at a hospital above a 1 infection per 100 person-days at risk are considered high. A hospital had 10 infections over the last 1787 person-days at risk. Give the p-value of the correct one-sided test of whether the hospital is below the standard.
Since we looking at the number of events (# of infections) occurring within a given timeframe, this is a Poisson distribution question.
The probability of observing k events in an interval
Null (H0): 1 infection per person-days
Alternative (H1): >1 infection per person-days
k (actual) = 10 infections
lambda (theoretical) = (1/100)*1787
p = 0.032372 or 3.2372% calculated using .poisson() in excel or ppois in R
Since p-value < alpha (assuming 5% level of significance), we reject the null and conclude that the hospital is below the standard.
How to determine if an idea is worth testing?
- conduct quantitative analysis using historical data to obtain the opportunity sizing of each idea.
Duration of an AB test?
rule of thumb is: 16 * Sample variance)^2 / delta b/w treatment and control.
For delta, we use the minimum detectable effect (determined by business partners)
Sample variance can be determined from existing data.
How to deal with interference between control and treatment groups?
For two-sided markets,
- geo-based randomization (split by geo-location = NY to control, SF to treatment group, has its own challenges, variance because each market is unique, has local competitors, etc)
- Time-based randomization - select random time and assign all users to either the control or treatment group. Works best if treatment effect only lasts a short amount of time.