A/B Testing Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

What is goal of AB testing?

A

to say hwhich version of a product performs better with users. objectively compare and contrast aspects of product choices so we can make data driven decisions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Different types of ab tests?

A

AB test, ABN test where N stands for number of versions being tested (used by indeed), the sis best for major layout or design decisions; lastly Multivariate testing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some usual metrics indeed would test?

A

click through rate, applications per week, job listings per week, bocce rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How long should you run an AB test?

A

Given by Duration (weeks) = (Nv · Nu ) / (p · M/4 )
Nv = number of variants
Nu = number of users needed per variant (from the above table)
p = fraction of users in this test (e.g., if this test runs on 5% of users, p = 0.05)
M = MAU (monthly active users)

Looking at the above formula, as your fractio of users and monthly users goes up, the less duration you need to run your test. Also, as the number of variants in your experiment and number of users per variant increases, the longer you need to run your test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a customer funnel?

A

users flow down the funnel, with more users at top. homepage visits-> exploring site -> creating account -> submitting app

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When can we use binomial distribution?

A

2 types of outcomes
Independent events (outcome of one coin doesn’t affect outcome of another)
Identical distribution
P is same for all events
The usefulness of binomial distribution is that now we can estimate what the standard error of the observed data will be, and can calculate the confidence intervals.
If the binomial distribution is large enough, then we can use the normal distribution assumption. A good estimate of this is if Np(hat) > 5, where N is number of samples and p(hat) is the calculated probability (# of users who clicked / # of users).
To use normal: check N
p(hat) > 5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When do we know we can use normal distribution estimate?

A

check N*p(hat) > 5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Equation for margin of error?

A
Margin of error = Z-score * sqrt(mu,hat*(1-mu*hat) / N) -> more samples, the small the margin; and more uncertain the probabilities, the bigger the range since the observed value is more likely to be one class or another. 
Margin of error = 2.58 * sqrt(0.15*0.85/2000)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When to use two tailed vs one tailed test?

A

The null hypothesis and alternative hypothesis proposed here correspond to a two-tailed test, which allows you to distinguish between three cases:
A statistically significant positive result
A statistically significant negative result
No statistically significant difference.
Sometimes when people run A/B tests, they will use a one-tailed test, which only allows you to distinguish between two cases:
A statistically significant positive result
No statistically significant result
Which one you should use depends on what action you will take based on the results. If you’re going to launch the experiment for a statistically significant positive change, and otherwise not, then you don’t need to distinguish between a negative result and no result, so a one-tailed test is good enough. If you want to learn the direction of the difference, then a two-tailed test is necessary.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you assess the statistical significance of an insight?

A

You would perform hypothesis testing to determine statistical significance. First, you would state the null hypothesis and alternative hypothesis. Second, you would calculate the p-value, the probability of obtaining the observed results of a test assuming that the null hypothesis is true. Last, you would set the level of the significance (alpha) and if the p-value is less than the alpha, you would reject the null — in other words, the result is statistically significant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?

A

A long-tailed distribution is a type of heavy-tailed distribution that has a tail (or tails) that drop off gradually and asymptotically.
3 practical examples include the power law, the Pareto principle (more commonly known as the 80–20 rule), and product sales (i.e. best selling products vs others).
It’s important to be mindful of long-tailed distributions in classification and regression problems because the least frequently occurring values make up the majority of the population. This can ultimately change the way that you deal with outliers, and it also conflicts with some machine learning techniques with the assumption that the data is normally distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the Central Limit Theorem? Explain it. Why is it important?

A

“The central limit theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size gets larger no matter what the shape of the population distribution.” [1]
The central limit theorem is important because it is used in hypothesis testing and also to calculate confidence intervals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the statistical power?

A

‘Statistical power’ refers to the power of a binary hypothesis, which is the probability that the test rejects the null hypothesis given that the alternative hypothesis is true. [2]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?

A

Selection bias is the phenomenon of selecting individuals, groups or data for analysis in such a way that proper randomization is not achieved, ultimately resulting in a sample that is not representative of the population.
Understanding and identifying selection bias is important because it can significantly skew results and provide false insights about a particular population group.
Types of selection bias include:
sampling bias: a biased sample caused by non-random sampling
time interval: selecting a specific time frame that supports the desired conclusion. e.g. conducting a sales analysis near Christmas.
exposure: includes clinical susceptibility bias, protopathic bias, indication bias. Read more here.
data: includes cherry-picking, suppressing evidence, and the fallacy of incomplete evidence.
attrition: attrition bias is similar to survivorship bias, where only those that ‘survived’ a long process are included in an analysis, or failure bias, where those that ‘failed’ are only included
observer selection: related to the Anthropic principle, which is a philosophical consideration that any data we collect about the universe is filtered by the fact that, in order for it to be observable, it must be compatible with the conscious and sapient life that observes it. [3]
Handling missing data can make selection bias worse because different methods impact the data in different ways. For example, if you replace null values with the mean of the data, you adding bias in the sense that you’re assuming that the data is not as spread out as it might actually be.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Provide a simple example of how an experimental design can help answer a question about behavior. How does experimental data contrast with observational data?

A

Observational data comes from observational studies which are when you observe certain variables and try to determine if there is any correlation.
Experimental data comes from experimental studies which are when you control certain variables and hold them constant to determine if there is any causality.
An example of experimental design is the following: split a group up into two. The control group lives their lives normally. The test group is told to drink a glass of wine every night for 30 days. Then research can be conducted to see how wine affects sleep.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Is mean imputation of missing data acceptable practice? Why or why not?

A

Mean imputation is the practice of replacing null values in a data set with the mean of the data.
Mean imputation is generally bad practice because it doesn’t take into account feature correlation. For example, imagine we have a table showing age and fitness score and imagine that an eighty-year-old has a missing fitness score. If we took the average fitness score from an age range of 15 to 80, then the eighty-year-old will appear to have a much higher fitness score that he actually should.
Second, mean imputation reduces the variance of the data and increases bias in our data. This leads to a less accurate model and a narrower confidence interval due to a smaller variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset. Also, explain what an inlier is and how you might screen for them and what would you do if you found them in your dataset.

A

An outlier is a data point that differs significantly from other observations.
Depending on the cause of the outlier, they can be bad from a machine learning perspective because they can worsen the accuracy of a model. If the outlier is caused by a measurement error, it’s important to remove them from the dataset. There are a couple of ways to identify outliers:
Z-score/standard deviations: if we know that 99.7% of data in a data set lie within three standard deviations, then we can calculate the size of one standard deviation, multiply it by 3, and identify the data points that are outside of this range. Likewise, we can calculate the z-score of a given point, and if it’s equal to +/- 3, then it’s an outlier.
Note: that there are a few contingencies that need to be considered when using this method; the data must be normally distributed, this is not applicable for small data sets, and the presence of too many outliers can throw off z-score.

Interquartile Range (IQR): IQR, the concept used to build boxplots, can also be used to identify outliers. The IQR is equal to the difference between the 3rd quartile and the 1st quartile. You can then identify if a point is an outlier if it is less than Q1–1.5*IRQ or greater than Q3 + 1.5*IQR. This comes to approximately 2.698 standard deviations.
Other methods include DBScan clustering, Isolation Forests, and Robust Random Cut Forests.

An inlier is a data observation that lies within the rest of the dataset and is unusual or an error. Since it lies in the dataset, it is typically harder to identify than an outlier and requires external data to identify them. Should you identify any inliers, you can simply remove them from the dataset to address them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How can you test if your dad is normally distributed?

A

You can perform Shapiro Will test

scipy.stats.shapiro(x)[source]
Perform the Shapiro-Wilk test for normality.

The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How do you handle missing data? What imputation techniques do you recommend?

A

There are several ways to handle missing data:
Delete rows with missing data
Mean/Median/Mode imputation
Assigning a unique value
Predicting the missing values
Using an algorithm which supports missing values, like random forests
The best method is to delete rows with missing data as it ensures that no bias or variance is added or removed, and ultimately results in a robust and accurate model. However, this is only recommended if there’s a lot of data to start with and the percentage of missing values is low.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

You have data on the duration of calls to a call center. Generate a plan for how you would code and analyze these data. Explain a plausible scenario for what the distribution of these durations might look like. How could you test, even graphically, whether your expectations are borne out?

A

First I would conduct EDA — Exploratory Data Analysis to clean, explore, and understand my data. See my article on EDA here. As part of my EDA, I could compose a histogram of the duration of calls to see the underlying distribution.
My guess is that the duration of calls would follow a lognormal distribution (see below). The reason that I believe it’s positively skewed is because the lower end is limited to 0 since a call can’t be negative seconds. However, on the upper end, it’s likely for there to be a small proportion of calls that are extremely long relatively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Explain likely differences between administrative datasets and datasets gathered from experimental studies. What are likely problems encountered with administrative data? How do experimental methods help alleviate these problems? What problem do they bring?

A

Administrative datasets are typically datasets used by governments or other organizations for non-statistical reasons.
Administrative datasets are usually larger and more cost-efficient than experimental studies. They are also regularly updated assuming that the organization associated with the administrative dataset is active and functioning. At the same time, administrative datasets may not capture all of the data that one may want and may not be in the desired format either. It is also prone to quality issues and missing entries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

You’re about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it’s raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that “Yes” it is raining. What is the probability that it’s actually raining in Seattle?

A

You can tell that this question is related to Bayesian theory because of the last statement which essentially follows the structure, “What is the probability A is true given B is true?” Therefore we need to know the probability of it raining in London on a given day. Let’s assume it’s 25%.
P(A) = probability of it raining = 25%
P(B) = probability of all 3 friends say that it’s raining
P(A|B) probability that it’s raining given they’re telling that it is raining
P(B|A) probability that all 3 friends say that it’s raining given it’s raining = (2/3)³ = 8/27
Step 1: Solve for P(B)
P(A|B) = P(B|A) * P(A) / P(B), can be rewritten as
P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
P(B) = (2/3)³ * 0.25 + (1/3)³ * 0.75 = 0.258/27 + 0.751/27
Step 2: Solve for P(A|B)
P(A|B) = 0.25 * (8/27) / ( 0.258/27 + 0.751/27)
P(A|B) = 8 / (8 + 3) = 8/11
Therefore, if all three friends say that it’s raining, then there’s an 8/11 chance that it’s actually raining.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

There’s one box — has 12 black and 12 red cards, 2nd box has 24 black and 24 red; if you want to draw 2 cards at random from one of the 2 boxes, which box has the higher probability of getting the same color? Can you tell intuitively why the 2nd box has a higher probability

A

The box with 24 red cards and 24 black cards has a higher probability of getting two cards of the same color. Let’s walk through each step.
Let’s say the first card you draw from each deck is a red Ace.
This means that in the deck with 12 reds and 12 blacks, there’s now 11 reds and 12 blacks. Therefore your odds of drawing another red are equal to 11/(11+12) or 11/23.
In the deck with 24 reds and 24 blacks, there would then be 23 reds and 24 blacks. Therefore your odds of drawing another red are equal to 23/(23+24) or 23/47.
Since 23/47 > 11/23, the second deck with more cards has a higher probability of getting the same two cards.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is: lift, KPI, robustness, model fitting, design of experiments, 80/20 rule?

A

Lift: lift is a measure of the performance of a targeting model measured against a random choice targeting model; in other words, lift tells you how much better your model is at predicting things than if you had no model.
KPI: stands for Key Performance Indicator, which is a measurable metric used to determine how well a company is achieving its business objectives. Eg. error rate.
Robustness: generally robustness refers to a system’s ability to handle variability and remain effective.
Model fitting: refers to how well a model fits a set of observations.
Design of experiments: also known as DOE, it is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variable. [4] In essence, an experiment aims to predict an outcome based on a change in one or more inputs (independent variables).
80/20 rule: also known as the Pareto principle; states that 80% of the effects come from 20% of the causes. Eg. 80% of sales come from 20% of customers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Give examples of data that does not have a Gaussian distribution, nor log-normal.

A

Any type of categorical data won’t have a gaussian distribution or lognormal distribution.
Exponential distributions — eg. the amount of time that a car battery lasts or the amount of time until an earthquake occurs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is root cause analysis? How to identify a cause vs. a correlation? Give examples

A

Root cause analysis: a method of problem-solving used for identifying the root cause(s) of a problem [5]
Correlation measures the relationship between two variables, range from -1 to 1. Causation is when a first event appears to have caused a second event. Causation essentially looks at direct relationships while correlation can look at both direct and indirect relationships.
Example: a higher crime rate is associated with higher sales in ice cream in Canada, aka they are positively correlated. However, this doesn’t mean that one causes another. Instead, it’s because both occur more when it’s warmer outside.
You can test for causation using hypothesis testing or A/B testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Give an example where the median is a better measure than the mean

A

When there are a number of outliers that positively or negatively skew the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the Law of Large Numbers?

A

The Law of Large Numbers is a theory that states that as the number of trials increases, the average of the result will become closer to the expected value.
Eg. flipping heads from fair coin 100,000 times should be closer to 0.5 than 100 times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

How do you calculate the needed sample size?

A

You can use the margin of error (ME) formula to determine the desired sample size.
t/z = t/z score used to calculate the confidence interval
ME = the desired margin of error
S = sample standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

When you sample, what bias are you inflicting?

A

Potential biases include the following:
Sampling bias: a biased sample caused by non-random sampling
Under coverage bias: sampling too few observations
Survivorship bias: error of overlooking observations that did not make it past a form of selection process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How do you control for biases?

A

There are many things that you can do to control and minimize bias. Two common things include randomization, where participants are assigned by chance, and random sampling, sampling in which each member has an equal probability of being chosen.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What are confounding variables?

A

A confounding variable, or a confounder, is a variable that influences both the dependent variable and the independent variable, causing a spurious association, a mathematical relationship in which two or more variables are associated but not causally related.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is A/B testing?

A

A/B testing is a form of hypothesis testing and two-sample hypothesis testing to compare two versions, the control and variant, of a single variable. It is commonly used to improve and optimize user experience and marketing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Infection rates at a hospital above a 1 infection per 100 person-days at risk are considered high. A hospital had 10 infections over the last 1787 person-days at risk. Give the p-value of the correct one-sided test of whether the hospital is below the standard.

A

Since we looking at the number of events (# of infections) occurring within a given timeframe, this is a Poisson distribution question.

The probability of observing k events in an interval
Null (H0): 1 infection per person-days
Alternative (H1): >1 infection per person-days
k (actual) = 10 infections
lambda (theoretical) = (1/100)*1787
p = 0.032372 or 3.2372% calculated using .poisson() in excel or ppois in R
Since p-value < alpha (assuming 5% level of significance), we reject the null and conclude that the hospital is below the standard.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How to determine if an idea is worth testing?

A
  • conduct quantitative analysis using historical data to obtain the opportunity sizing of each idea.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Duration of an AB test?

A

rule of thumb is: 16 * Sample variance)^2 / delta b/w treatment and control.

For delta, we use the minimum detectable effect (determined by business partners)

Sample variance can be determined from existing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

How to deal with interference between control and treatment groups?

A

For two-sided markets,

  1. geo-based randomization (split by geo-location = NY to control, SF to treatment group, has its own challenges, variance because each market is unique, has local competitors, etc)
  2. Time-based randomization - select random time and assign all users to either the control or treatment group. Works best if treatment effect only lasts a short amount of time.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is interference b/w control and treatment group?

A

Normally, we split the control and treatment groups by randomly selecting users and assigning each user to either the control or treatment group. but sometimes this assumption doesn’t hold -> for social networks where you have network effects, or two-sided markets such as Uber, Lyft, Airbnb.

For example, for Uber, if new product that attracts more drivers in the treatment group, fewer drives will be available in the control group.

39
Q

What are novelty and primacy effects?

A

When there’s a product change people react to it differently. People that are used to how the product works and are reluctant to change - the sis called primacy effect. People that welcome changes and a new feature attracts them is called novelty effects. After sometime these effects stabilize.

40
Q

We are launching a new feature that provides coupons to our riders. The goal is to increase the number of rides by decreasing the price for each ride. Outline a testing strategy to evaluate the effect of the new feature.

A

isolate users - deal with interference b/w control and treatment group

41
Q

We ran an A/B test on a new feature and the test won, so we launched the change to all users. However, after launching the feature for a week, we found that the treatment effect quickly declined. What is happening?

A

The answer is the novelty effect. Over time, as the novelty wears off, repeat usage will be decreased so we observe a declining treatment effect.

42
Q

How do we address issues from novelty effect?

A

One way to deal with such effects is to completely rule out the possibility of those effects. We could run tests only on first-time users because the novelty effect and primacy effect obviously doesn’t affect such users. If we already have a test running and we want to analyze if there’s a novelty or primacy effect, we could 1) compare new users’ results in the control group to those in the treatment group to evaluate novelty effect 2) compare first-time users’ results with existing users’ results in the treatment group to get an actual estimate of the impact of the novelty or primacy effect.

43
Q

If we have 3 treatment groups to compare with the control group (3 variations), what is the chance of observing at least 1 false positive (assume our significance level is 0.05)?

A

We could get the probability that there is no false positives (assuming the groups are independent),
Pr(FP = 0) = 0.95 * 0.95 * 0.95 = 0.857
then obtain the probability that there’s at least 1 false positive
Pr(FP >= 1) = 1 — Pr(FP = 0) = 0.143

44
Q

We are running a test with 10 variants, trying different versions of our landing page. One treatment wins and the p-value is less than .05. Would you make the change?

A

The answer is no because of the multiple testing problem. There are several ways to approach it. One commonly used method is Bonferroni correction. It divides the significance level 0.05 by the number of tests. For the interview question, since we are measuring 10 tests, then the significance level for the test should be 0.05 divided by 10 which is 0.005. Basically, we only claim a test if significant if it shows a p-value of less than 0.005. The drawback of Bonferroni correction is that it tends to be too conservative.

Another method is to control the false discovery rate (FDR):
FDR = E[# of false positive / # of rejections]
It measures out of all of the rejections of the null hypothesis, that is, all the metrics that you declare to have a statistically significant difference. How many of them had a real difference as opposed to how many were false positives. This only makes sense if you have a huge number of metrics, say hundreds. Suppose we have 200 metrics and cap FDR at 0.05. This means we’re okay with seeing false positives 5 of the time. We will observe at least 10 false positive in those 200 metrics every time.

45
Q

After running a test, you see the desired metric, such as the click-through rate is going up while the number of impressions is decreasing. How would you make a decision?

A

During interviews, we could provide a simplified version of the solution, focusing on the current objective of the experiment. Is it to maximize engagement, retention, revenue, or something else? Also, we want to quantify the negative impact, i.e. the negative shift in a non-goal metric, to help us make the decision. For instance, if revenue is the goal, we could choose it over maximizing engagement assuming the negative impact is acceptable.

46
Q

When to use a z-test versus a t-test?

A

If metric follows Bernoulli distribution, then if np> 10, n(1-p)>10, then we do Z-test. Otherwise Binomial test.

If metric doesn’t follow Bernoulli distribution, then if sample size > 30 and population variance is known, then Z-test. If sample size > 30 and population variance is NOT known, then T-test.
If sample size < 30, we need to verify that population distribution is normal.
Note: we don’t need to verify that population variance is normal for n>30 because of Central Limit Theorem which says that a large n will be normal distribution.

47
Q

When to use a z-test versus a t-test?

A

If metric follows Bernoulli distribution, then if np> 10, n(1-p)>10, then we do Z-test. Otherwise Binomial test.

If metric doesn’t follow Bernoulli distribution, then if sample size > 30 and population variance is known, then Z-test. If sample size > 30 and population variance is NOT known, then T-test.
If sample size < 30, we need to verify that population distribution is normal.
Note: we don’t need to verify that population variance is normal for n>30 because of Central Limit Theorem which says that a large n will be normal distribution.

If population is not normal, then we can use Mann Whitney U to compare the distributions.

Z-test is less common because it requires we have population variance which we usually don’t.

48
Q

Given a test result, calculate if the result is significant?

A

x

49
Q

How to make launch decisions?

A

x

50
Q

Computer test statistic, tell if it’s significant?

A

x

51
Q

what is Bernoulli distribution?

A

random variable
P(r1) = p
P(r2) = 1-p

Bernoulli distribution is used by something like Click Through Probability (CTR) -> either click or not click (success vs failure).
Is it a proportion or not?
Percentage of users or pages

52
Q

What is Central Limit Theorem?

A

If large n&raquo_space;, distribution will be normal.

53
Q

What is Central Limit Theorem?

A

If large n&raquo_space;, distribution will be normal.

54
Q

What is different between student’s T vs normal?

A

t-test is used when test statistic follows a Student’s t-distribution under null hypothesis.
t-distribution is more spread out, standard deviation is unknown.
T-distribution produces wider confidence interval than z-distribution. - with a t-test and small n, It is more likely that you won’t find practically and statistically significant results because your margin of error will be quite large (ZSE), where SE=sqrt(mu(1-mu)*1/n,ct + 1/n,tr))
We are less certain about our estimate.

Degree of freedom: number of pieces of info that can freely vary without violating restrictions.

As N increases, t-distribution approximates normal. For large N, t-test gives almost the same p-values and C.I. as z-test.

55
Q

Why do we not use t-test for proportions (i.e. distributions that follow Bernoulli distribution)?

A

This is because Test statistic doesn’t have a t-distribution - it instead approximately follows a normal distribution.

t, test statistic = d/s, where d is diff b/w means, and s is estimated standard error.
When sample size is large, d is asymptotically normal,

Summary:

  • using z-tests is correct
  • using t-tests is wrong (academically)
  • results are similar when sample size is large
56
Q

Example: test color of button
Click through probability: N(users who clicked) / N(total users)
1000 users in both control & treatment groups

Results:
Control group: 1.1% CTP
Treatment group: 2.3% CTP

Significant difference? Launch the feature?

A

alpha, significance level = 0.05
practice significance boundary = 0.01

To determine if we should launch, we need to test whether the results are statistically and practically significant.

Answer:
which hypothesis test to use?
we know it’s Bernoulli distribution -> two outcomes, success or failure, proportion, and np>10 and n(1-p)>10 because 1000*0.011 = 11 which is greater than 10, so we can do Z-test.
d, difference between groups = 1.2% = 0.012

Null hypothesis: p(c) = p(t), d=0
alternate hypothesis: p(c)!=p(t), d!=0

p(c) = 11/1000
p(t) = 23/1000

T, test statistic = p(t) - p(c) / Pooled standard error (SE)
to calculate pooled standard error, we need pooled probability:
po, pooled prob = (11+23)/(1000+1000) = 0.017
SE = sqrt(po(1-po)(1/nc + 1/nt))
SE = sqrt(0.017(1-0.017)(1/1000 + 1/1000))

T, test statistic = (0.012)/(SE) = 2.076

Z-test at 0.05 significance level is 1.96
reject null hypothesis is T > Z or T 1.96 so we reject the null hypothesis. -> test is statistically significant.

but is it practically significant?
we know that d, min = 0.010
T = 0.012
What are the confidence intervals?
CI = {T-margin or error : T+margin of error}
margin of error = Z-statistic * SE = 1.96*SE = 0.0113
CI = {0.012-0.0113 : 0.012+0.0113} = {0.0007 : 0.0233}
The lower CI boundary < d,min so the result is not practically significant and shouldn’t be launched.

How to interpret results: it seems that change is practically significant but we don’t have confidence in that, so no launch.

57
Q

What test to use when two samples have different variances and/or sample sizes?

A

We can use Welch’s t-test for which we would need to calculate the unpooled standard error.

58
Q

What is Power?

A

used in binary hypothesis test. It is the probability that test correctly rejects the null hypothesis when the alternative hypothesis is true (true negative). Likelihood that a test will detect an effect when an effect is present. The higher the statistical power, the better the test is.

59
Q

What is type I error?

A

also known as false positive. categorize errors in a binary hypothesis test. When we mistakenly reject true null hypothesis. Larger value -> the less reliable a test is. Commonly used in AB testing

60
Q

What is type II error?

A

also known as false negative. it occurs when we fail to reject false null hypothesis. We conclude there is no significant effect when there actually is. Larger value -> the less reliable a test is.

61
Q

What is a confidence interval?

A

when we want to know how variable a sample result is. CI is for estimating true value. CI is a range of values. The wider the interval, the more uncertainty about the sample result. The less date, the wider the CI. The higher confidence level, the wider the CI.

62
Q

What is a p-value?

A

p-value is the probability that an results will be as extreme as the observed results given that the null hypothesis is true. the low p-value, the less support of null hypothesis. Often we use 0.05 in hypothesis testing. It’s used in AB testing to test a metric in treatment and control group. The small a p-value, the more certain we are that there is a difference b/w two groups.

p-value = Pr(there’s a difference | null hypothesis is true)

NOT Pr(null hypothesis is true | there’s a difference)

63
Q

What are the assumptions with linear regression?

A

There are four assumptions associated with a linear regression model:

Linearity: The relationship between X and the mean of Y is linear.
Homoscedasticity: The variance of residual is the same for any value of X.
Independence: Observations are independent of each other.
Normality: For any fixed value of X, Y is normally distributed.

64
Q

What is chi-square statistic for hypothesis testing?

A

Definition: way of comparing weather there is a relationship between the observed results vs expected results. On the other hand, t-test is seeing if the metric being tested between two distributions is different enough.
How do we use this?
We can look at the chi-square distribution, and see the cumulative distribution below that number that are just as extreme.

65
Q

What is Bayes Theorem?

A

P(h|D) = P(D|h) * P(h) / P(D)
Breaking this down, it says that the probability of a given hypothesis holding or being true given some observed data can be calculated as the probability of observing the data given the hypothesis multiplied by the probability of the hypothesis being true regardless of the data, divided by the probability of observing the data regardless of the hypothesis.

Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior probability, the probabilities of observing various data given the hypothesis, and the observed data itself.

66
Q

Absolute vs Relative differences in metrics?

A

Simple way is to take difference between the controlled and experiment metrics, but usually there is a better way. Relative percent change is great for when you have a lot of variables.
With seasonality and system changes over time. Fewer users in June than in December for a shopping site. If you have one relative difference you can keep your practical significance boundary.
Disadvantage is variability - relative metrics can vary a lot more than absolute differences
Example: if your Click through probability was 5%, and your experiment click through probability was 7%, your relative different would be 40 percent and your absolute different would 2 percentage points.

67
Q

Explain t-test

A

It is a type of inferential statistic used to study if there is a statistical difference between two groups. Mathematically, it establishes the problem by assuming that the means of the two distributions are equal (H₀: µ₁=µ₂). If the t-test rejects the null hypothesis (H₀: µ₁=µ₂), it indicates that the groups are highly probably different.
This test should be implemented when the groups have 20–30 samples. If we want to examine more groups or larger sample sizes, there are other tests more accurate than t-tests such as z-test, chi-square test or f-test.

68
Q

what are t-test types?

A

unpaired -> parametric-> equal variance -> student t-test
unpaired -> parametric-> NOT equal variance ->Welch t-test
unpaired -> NON parametric -> mann-whitney U test
paired -> parametric -> paired t-test
paired -> NON-parametric -> wilconson signed rank test

69
Q

what are degrees of freedom in t-test?

A

refers to the maximum number of logically independent values in the data sample

70
Q

How to check whether there is bias in your samples?

A

Split your data by different segments - by region/by type of member/type of applicants, etc -> see if some of these partitions have higher likelihood of effect that you are trying to measure and if there is uneven representation in your sampling strategy.

71
Q

How to measure sensitivity and robustness of metrics?

A

A/A tests pick up data may be very sensitive -> see which summary metrics prove to vary a lot

72
Q

How to reduce size of an experiment?

A

They don’t want to spend that long on this experiment, 300k pageviews per group
What other things they can change besides d,min, alpha (false positive), and beta (false negative)

They can also do the following:
Change unit of diversion to page view from cookie
This would increase the number of page views. Negative: less consistent experience for user. Variability of metric will decrease because it is same as unit of analysis
Target experiment to specific traffic
Restricting to english traffic only will prevent dilutting the traffic
Could also impact choice of practical significance boundary.
Change metric to cookie based click through probability
Will reduce it but not that much

73
Q

P(A or B) =?

A

P(A or B) = P(A) + P(B) - P(A and B)

74
Q

P(A and B) = ?

A

P(A and B) = P(A) * P(B) [only for independent events]

75
Q

Equation for min-max standardization?

A

z = (x,i - min(x) ) / (max(x) - min(x))

76
Q

What is logistic regression? What is the loss function in logistic regression?

A

Answer : Logistic Regression is the Binary Classification. It is a statistical model that uses the logit function on the top of the probability to give 0 or 1 as a result.

The loss function in LR is known as the Log Loss function.

77
Q

Different b/w regression and classification?

A

Answer : The major difference between Regression and Classification is that Regression results in a continuous quantitative value while Classification is predicting the discrete labels.

However, there is no clear line that draws the difference between the two. We have a few properties of both Regression and Classification. These are as follows:

Regression

Regression predicts the quantity.
We can have discrete as well as continuous values as input for regression.
If input data are ordered with respect to the time it becomes time series forecasting.
Classification

The Classification problem for two classes is known as Binary Classification.
Classification can be split into Multi- Class Classification or Multi-Label Classification.
We focus more on accuracy in Classification while we focus more on the error term in Regression.

78
Q

Why do we need Evaluation Metrics. What do you understand by Confusion Matrix ?

A

Evaluation Metrics are statistical measures of model performance. They are very important because to determine the performance of any model it is very significant to use various Evaluation Metrics. Few of the evaluation Metrics are – Accuracy, Log Loss, Confusion Matrix.

Confusion Matrix is a matrix to find the performance of a Classification model. It is in general a 2×2 matrix with one side as prediction and the other side as actual values.

79
Q

How does Confusion Matrix help in evaluating model performance?

A

accuracy: overall performance of model: (TP+TN) / (TP+FP+TN+TP)
precision: how accurate the positive predictions are: (TP)/(TP+FP)
recall: coverage of actual positive sample: TP/(TP+FN)

f1 score: hybrid metric useful for unbalanced classes: (2TP)/(2TP + FP + FN)

80
Q

What is the significance of Sampling? Name some techniques for Sampling?

A

For analyzing the data we cannot proceed with the whole volume at once for large datasets. We need to take some samples from the data which can represent the whole population. While making a sample out of complete data, we should take that data which can be a true representative of the whole data set.

There are mainly two types of Sampling techniques based on Statistics.

Probability Sampling and Non Probability Sampling

Probability Sampling – Simple Random, Clustered Sampling, Stratified Sampling.

Non Probability Sampling – Convenience Sampling, Quota Sampling, Snowball Sampling.

81
Q

What are Type 1 and Type 2 errors? In which scenarios the Type 1 and Type 2 errors become significant?

A

Rejection of True Null Hypothesis is known as a Type 1 error. In simple terms, False Positive are known as a Type 1 Error.

Not rejecting the False Null Hypothesis is known as a Type 2 error. False Negatives are known as a Type 2 error.

Type 1 Error is significant where the importance of being negative becomes significant. For example – If a man is not suffering from a particular disease marked as positive for that infection. The medications given to him might damage his organs.

While Type 2 Error is significant in cases where the importance of being positive becomes important. For example – The alarm has to be raised in case of burglary in a bank. But a system identifies it as a False case that won’t raise the alarm on time resulting in a heavy loss.

82
Q

What are the conditions for Overfitting and Underfitting?

A

In Overfitting the model performs well for the training data, but for any new data it fails to provide output. For Underfitting the model is very simple and not able to identify the correct relationship. Following are the bias and variance conditions.

Overfitting – Low bias and High Variance results in overfitted model. Decision tree is more prone to Overfitting.

Underfitting – High bias and Low Variance. Such model doesn’t perform well on test data also. For example – Linear Regression is more prone to Underfitting.

83
Q

What do you mean by Normalisation? Difference between Normalisation and Standardization?

A

Normalisation is a process of bringing the features in a simple range, so that model can perform well and do not get inclined towards any particular feature.

For example – If we have a dataset with multiple features and one feature is the Age data which is in the range 18-60 , Another feature is the salary feature ranging from 20000 – 2000000. In such a case, the values have a very much difference in them. Age ranges in two digits integer while salary is in range significantly higher than the age. So to bring the features in comparable range we need Normalisation.

Both Normalisation and Standardization are methods of Features Conversion. However, the methods are different in terms of the conversions. The data after Normalisation scales in the range of 0-1. While in case of Standardization the data is scaled such that it means comes out to be 0.

84
Q

What do you mean by Regularisation? What are L1 and L2 Regularisation?

A

Regulation is a method to improve your model which is Overfitted by introducing extra terms in the loss function. This helps in making the model performance better for unseen data.

There are two types of Regularisation :

L1 Regularisation – In L1 we add lambda times the absolute weight terms to the loss function. In this the feature weights are penalised on the basis of absolute value.

L2 Regularisation – In L2 we add lambda times the squared weight terms to the loss function. In this the feature weights are penalised on the basis of squared values.

85
Q

Describe Decision tree Algorithm and what are entropy and information gain?

A

Decision tree is a Supervised Machine Learning approach. It uses the predetermined decisions data to prepare a model based on previous output. It follows a system to identify the pattern and predict the classes or output variable from previous output .

The Decision tree works in the following manner –

It takes the complete set of Data and try to identify a point with highest information gain and least entropy to mark it as a data node and proceed further in this manner. Entropy and Information gain are deciding factor to identify the data node in a Decision Tree.

86
Q

What is Ensemble Learning. Give an important example of Ensemble Learning?

A

Ensemble Learning is a process of accumulating multiple models to form a better prediction model. In Ensemble Learning the performance of the individual model contributes to the overall development in every step. There are two common techniques in this – Bagging and Boosting.

Bagging – In this the data set is split to perform parallel processing of models and results are accumulated based on performance to achieve better accuracy.

Boosting – This is a sequential technique in which a result from one model is passed to another model to reduce error at every step making it a better performance model.

The most important example of Ensemble Learning is Random Forest Classifier. It takes multiple Decision Tree combined to form a better performance Random Forest model.

87
Q

Explain Naive Bayes Classifier and the principle on which it works?

A

Naive Bayes Classifier algorithm is a probabilistic model. This model works on the Bayes Theorem principle. The accuracy of Naive Bayes can be increased significantly by combining it with other kernel functions for making a perfect Classifier.

Bayes Theorem – This is a theorem which explains the conditional probability. If we need to identify the probability of occurrence of Event A provided the Event B has already occurred such cases are known as Conditional Probability.

88
Q

What is Imbalanced Data? How do you manage to balance the data?

A

If a data is distributed across different categories and the distribution is highly imbalance. Such data are known as Imbalance Data. These kind of datasets causes error in model performance by making category with large values significant for the model resulting in an inaccurate model.

There are various techniques to handle imbalance data. We can increase the number of samples for minority classes. We can decrease the number of samples for classes with extremely high numbers of data points. We can use a cluster based technique to increase number of Data points for all the categories.

89
Q

Explain Unsupervised Clustering approach?

A

Grouping the data into different clusters based on the distribution of data is known as Clustering technique.

There are various Clustering Techniques –

  1. Density Based Clustering – DBSCAN , HDBSCAN
  2. Hierarchical Clustering.
  3. Partition Based Clustering
  4. Distribution Based Clustering.
90
Q

What do you mean by Cross Validation. Name some common cross Validation techniques?

A

Cross Validation is a model performance improvement technique. This is a Statistics based approach in which the model gets to train and tested with rotation within the training dataset so that model can perform well for unknown or testing data.

In this the training data are split into different groups and in rotation those groups are used for validation of model performance.

The common Cross Validation techniques are –

K- Fold Cross Validation

Leave p-out Cross Validation

Leave-one-out cross-validation.

Holdout method

91
Q

What is deep learning?

A

Deep Learning is the branch of Machine Learning and AI which tries to achieve better accuracy and able to achieve complex models. Deep Learning models are similar to human brains like structure with input layer, hidden layer, activation function and output layer designed in a fashion to give a human brain like structure.

Deep Learning have so many real time applications –

Self Driving Cars

Computer Vision and Image Processing

Real Time Chat bots

Home Automation Systems

92
Q

What is unsupervised learning?

A

Unsupervised learning is a machine learning technique, where you do not need to supervise the model. Instead, you need to allow the model to work on its own to discover information. It mainly deals with the unlabelled data.

Unsupervised learning algorithms allow you to perform more complex processing tasks compared to supervised learning. Although, unsupervised learning can be more unpredictable compared with other natural learning deep learning and reinforcement learning methods.

examples: clustering, association

93
Q

What is supervised learning?

A

In Supervised learning, you train the machine using data which is well “labeled.” It means some data is already tagged with the correct answer. It can be compared to learning which takes place in the presence of a supervisor or a teacher.

A supervised learning algorithm learns from labeled training data, helps you to predict outcomes for unforeseen data. Successfully building, scaling, and deploying accurate supervised machine learning Data science model takes time and technical expertise from a team of highly skilled data scientists. Moreover, Data scientist must rebuild models to make sure the insights given remains true until its data changes.

examples: regression, classification