A/B Testing Flashcards

Question

Give examples of data that does not have a Gaussian distribution, nor log-normal.

Answer 1

Any type of categorical data won’t have a gaussian distribution or lognormal distribution. Exponential distributions — eg. the amount of time that a car battery lasts or the amount of time until an earthquake occurs.

Answer 2

Root cause analysis: a method of problem-solving used for identifying the root cause(s) of a problem [5] Correlation measures the relationship between two variables, range from -1 to 1. Causation is when a first event appears to have caused a second event. Causation essentially looks at direct relationships while correlation can look at both direct and indirect relationships. Example: a higher crime rate is associated with higher sales in ice cream in Canada, aka they are positively correlated. However, this doesn’t mean that one causes another. Instead, it’s because both occur more when it’s warmer outside. You can test for causation using hypothesis testing or A/B testing.

Answer 3

When there are a number of outliers that positively or negatively skew the data.

Answer 4

The Law of Large Numbers is a theory that states that as the number of trials increases, the average of the result will become closer to the expected value. Eg. flipping heads from fair coin 100,000 times should be closer to 0.5 than 100 times.

Answer 5

You can use the margin of error (ME) formula to determine the desired sample size. t/z = t/z score used to calculate the confidence interval ME = the desired margin of error S = sample standard deviation

Answer 6

Potential biases include the following: Sampling bias: a biased sample caused by non-random sampling Under coverage bias: sampling too few observations Survivorship bias: error of overlooking observations that did not make it past a form of selection process.

Answer 7

There are many things that you can do to control and minimize bias. Two common things include randomization, where participants are assigned by chance, and random sampling, sampling in which each member has an equal probability of being chosen.

Answer 8

A confounding variable, or a confounder, is a variable that influences both the dependent variable and the independent variable, causing a spurious association, a mathematical relationship in which two or more variables are associated but not causally related.

Answer 9

A/B testing is a form of hypothesis testing and two-sample hypothesis testing to compare two versions, the control and variant, of a single variable. It is commonly used to improve and optimize user experience and marketing.

Answer 10

Since we looking at the number of events (# of infections) occurring within a given timeframe, this is a Poisson distribution question. The probability of observing k events in an interval Null (H0): 1 infection per person-days Alternative (H1): >1 infection per person-days k (actual) = 10 infections lambda (theoretical) = (1/100)*1787 p = 0.032372 or 3.2372% calculated using .poisson() in excel or ppois in R Since p-value < alpha (assuming 5% level of significance), we reject the null and conclude that the hospital is below the standard.

Answer 11

- conduct quantitative analysis using historical data to obtain the opportunity sizing of each idea.

Answer 12

rule of thumb is: 16 * Sample variance)^2 / delta b/w treatment and control. For delta, we use the minimum detectable effect (determined by business partners) Sample variance can be determined from existing data.

Answer 13

For two-sided markets, 1. geo-based randomization (split by geo-location = NY to control, SF to treatment group, has its own challenges, variance because each market is unique, has local competitors, etc) 2. Time-based randomization - select random time and assign all users to either the control or treatment group. Works best if treatment effect only lasts a short amount of time.

Answer 14

Normally, we split the control and treatment groups by randomly selecting users and assigning each user to either the control or treatment group. but sometimes this assumption doesn't hold -> for social networks where you have network effects, or two-sided markets such as Uber, Lyft, Airbnb. For example, for Uber, if new product that attracts more drivers in the treatment group, fewer drives will be available in the control group.

Answer 15

When there's a product change people react to it differently. People that are used to how the product works and are reluctant to change - the sis called primacy effect. People that welcome changes and a new feature attracts them is called novelty effects. After sometime these effects stabilize.

Answer 16

isolate users - deal with interference b/w control and treatment group

Answer 17

The answer is the novelty effect. Over time, as the novelty wears off, repeat usage will be decreased so we observe a declining treatment effect.

Answer 18

One way to deal with such effects is to completely rule out the possibility of those effects. We could run tests only on first-time users because the novelty effect and primacy effect obviously doesn’t affect such users. If we already have a test running and we want to analyze if there’s a novelty or primacy effect, we could 1) compare new users’ results in the control group to those in the treatment group to evaluate novelty effect 2) compare first-time users’ results with existing users’ results in the treatment group to get an actual estimate of the impact of the novelty or primacy effect.

Answer 19

We could get the probability that there is no false positives (assuming the groups are independent), Pr(FP = 0) = 0.95 * 0.95 * 0.95 = 0.857 then obtain the probability that there’s at least 1 false positive Pr(FP >= 1) = 1 — Pr(FP = 0) = 0.143

Answer 20

The answer is no because of the multiple testing problem. There are several ways to approach it. One commonly used method is Bonferroni correction. It divides the significance level 0.05 by the number of tests. For the interview question, since we are measuring 10 tests, then the significance level for the test should be 0.05 divided by 10 which is 0.005. Basically, we only claim a test if significant if it shows a p-value of less than 0.005. The drawback of Bonferroni correction is that it tends to be too conservative. Another method is to control the false discovery rate (FDR): FDR = E[# of false positive / # of rejections] It measures out of all of the rejections of the null hypothesis, that is, all the metrics that you declare to have a statistically significant difference. How many of them had a real difference as opposed to how many were false positives. This only makes sense if you have a huge number of metrics, say hundreds. Suppose we have 200 metrics and cap FDR at 0.05. This means we’re okay with seeing false positives 5 of the time. We will observe at least 10 false positive in those 200 metrics every time.

Answer 21

During interviews, we could provide a simplified version of the solution, focusing on the current objective of the experiment. Is it to maximize engagement, retention, revenue, or something else? Also, we want to quantify the negative impact, i.e. the negative shift in a non-goal metric, to help us make the decision. For instance, if revenue is the goal, we could choose it over maximizing engagement assuming the negative impact is acceptable.

Answer 22

If metric follows Bernoulli distribution, then if np> 10, n(1-p)>10, then we do Z-test. Otherwise Binomial test. If metric doesn't follow Bernoulli distribution, then if sample size > 30 and population variance is known, then Z-test. If sample size > 30 and population variance is NOT known, then T-test. If sample size < 30, we need to verify that population distribution is normal. Note: we don't need to verify that population variance is normal for n>30 because of Central Limit Theorem which says that a large n will be normal distribution.

Answer 23

If metric follows Bernoulli distribution, then if np> 10, n(1-p)>10, then we do Z-test. Otherwise Binomial test. If metric doesn't follow Bernoulli distribution, then if sample size > 30 and population variance is known, then Z-test. If sample size > 30 and population variance is NOT known, then T-test. If sample size < 30, we need to verify that population distribution is normal. Note: we don't need to verify that population variance is normal for n>30 because of Central Limit Theorem which says that a large n will be normal distribution. If population is not normal, then we can use Mann Whitney U to compare the distributions. Z-test is less common because it requires we have population variance which we usually don't.

Answer 24

random variable P(r1) = p P(r2) = 1-p Bernoulli distribution is used by something like Click Through Probability (CTR) -> either click or not click (success vs failure). Is it a proportion or not? Percentage of users or pages

Answer 25

If large n >>, distribution will be normal.

Answer 26

If large n >>, distribution will be normal.

Answer 27

t-test is used when test statistic follows a Student's t-distribution under null hypothesis. t-distribution is more spread out, standard deviation is unknown. T-distribution produces wider confidence interval than z-distribution. - with a t-test and small n, It is more likely that you won't find practically and statistically significant results because your margin of error will be quite large (Z*SE), where SE=sqrt(mu*(1-mu)*1/n,ct + 1/n,tr)) We are less certain about our estimate. Degree of freedom: number of pieces of info that can freely vary without violating restrictions. As N increases, t-distribution approximates normal. For large N, t-test gives almost the same p-values and C.I. as z-test.

Answer 28

This is because Test statistic doesn't have a t-distribution - it instead approximately follows a normal distribution. t, test statistic = d/s, where d is diff b/w means, and s is estimated standard error. When sample size is large, d is asymptotically normal, Summary: - using z-tests is correct - using t-tests is wrong (academically) - results are similar when sample size is large

Answer 29

alpha, significance level = 0.05 practice significance boundary = 0.01 To determine if we should launch, we need to test whether the results are statistically and practically significant. Answer: which hypothesis test to use? we know it's Bernoulli distribution -> two outcomes, success or failure, proportion, and np>10 and n(1-p)>10 because 1000*0.011 = 11 which is greater than 10, so we can do Z-test. d, difference between groups = 1.2% = 0.012 Null hypothesis: p(c) = p(t), d=0 alternate hypothesis: p(c)!=p(t), d!=0 ``` p(c) = 11/1000 p(t) = 23/1000 ``` T, test statistic = p(t) - p(c) / Pooled standard error (SE) to calculate pooled standard error, we need pooled probability: po, pooled prob = (11+23)/(1000+1000) = 0.017 SE = sqrt(po*(1-po)*(1/nc + 1/nt)) SE = sqrt(0.017*(1-0.017)*(1/1000 + 1/1000)) T, test statistic = (0.012)/(SE) = 2.076 Z-test at 0.05 significance level is 1.96 reject null hypothesis is T > Z or T 1.96 so we reject the null hypothesis. -> test is statistically significant. but is it practically significant? we know that d, min = 0.010 T = 0.012 What are the confidence intervals? CI = {T-margin or error : T+margin of error} margin of error = Z-statistic * SE = 1.96*SE = 0.0113 CI = {0.012-0.0113 : 0.012+0.0113} = {0.0007 : 0.0233} The lower CI boundary < d,min so the result is not practically significant and shouldn't be launched. How to interpret results: it seems that change is practically significant but we don't have confidence in that, so no launch.

Answer 30

We can use Welch's t-test for which we would need to calculate the unpooled standard error.

Answer 31

used in binary hypothesis test. It is the probability that test correctly rejects the null hypothesis when the alternative hypothesis is true (true negative). Likelihood that a test will detect an effect when an effect is present. The higher the statistical power, the better the test is.

Answer 32

also known as false positive. categorize errors in a binary hypothesis test. When we mistakenly reject true null hypothesis. Larger value -> the less reliable a test is. Commonly used in AB testing

Answer 33

also known as false negative. it occurs when we fail to reject false null hypothesis. We conclude there is no significant effect when there actually is. Larger value -> the less reliable a test is.

Answer 34

when we want to know how variable a sample result is. CI is for estimating true value. CI is a range of values. The wider the interval, the more uncertainty about the sample result. The less date, the wider the CI. The higher confidence level, the wider the CI.

Answer 35

p-value is the probability that an results will be as extreme as the observed results given that the null hypothesis is true. the low p-value, the less support of null hypothesis. Often we use 0.05 in hypothesis testing. It's used in AB testing to test a metric in treatment and control group. The small a p-value, the more certain we are that there is a difference b/w two groups. p-value = Pr(there's a difference | null hypothesis is true) NOT Pr(null hypothesis is true | there's a difference)

Answer 36

There are four assumptions associated with a linear regression model: Linearity: The relationship between X and the mean of Y is linear. Homoscedasticity: The variance of residual is the same for any value of X. Independence: Observations are independent of each other. Normality: For any fixed value of X, Y is normally distributed.

Answer 37

Definition: way of comparing weather there is a relationship between the observed results vs expected results. On the other hand, t-test is seeing if the metric being tested between two distributions is different enough. How do we use this? We can look at the chi-square distribution, and see the cumulative distribution below that number that are just as extreme.

Answer 38

P(h|D) = P(D|h) * P(h) / P(D) Breaking this down, it says that the probability of a given hypothesis holding or being true given some observed data can be calculated as the probability of observing the data given the hypothesis multiplied by the probability of the hypothesis being true regardless of the data, divided by the probability of observing the data regardless of the hypothesis. Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior probability, the probabilities of observing various data given the hypothesis, and the observed data itself.

Answer 39

Simple way is to take difference between the controlled and experiment metrics, but usually there is a better way. Relative percent change is great for when you have a lot of variables. With seasonality and system changes over time. Fewer users in June than in December for a shopping site. If you have one relative difference you can keep your practical significance boundary. Disadvantage is variability - relative metrics can vary a lot more than absolute differences Example: if your Click through probability was 5%, and your experiment click through probability was 7%, your relative different would be 40 percent and your absolute different would 2 percentage points.

Answer 40

It is a type of inferential statistic used to study if there is a statistical difference between two groups. Mathematically, it establishes the problem by assuming that the means of the two distributions are equal (H₀: µ₁=µ₂). If the t-test rejects the null hypothesis (H₀: µ₁=µ₂), it indicates that the groups are highly probably different. This test should be implemented when the groups have 20–30 samples. If we want to examine more groups or larger sample sizes, there are other tests more accurate than t-tests such as z-test, chi-square test or f-test.

Answer 41

unpaired -> parametric-> equal variance -> student t-test unpaired -> parametric-> NOT equal variance ->Welch t-test unpaired -> NON parametric -> mann-whitney U test paired -> parametric -> paired t-test paired -> NON-parametric -> wilconson signed rank test

Answer 42

refers to the maximum number of logically independent values in the data sample

Answer 43

Split your data by different segments - by region/by type of member/type of applicants, etc -> see if some of these partitions have higher likelihood of effect that you are trying to measure and if there is uneven representation in your sampling strategy.

Answer 44

A/A tests pick up data may be very sensitive -> see which summary metrics prove to vary a lot

Answer 45

They don’t want to spend that long on this experiment, 300k pageviews per group What other things they can change besides d,min, alpha (false positive), and beta (false negative) They can also do the following: Change unit of diversion to page view from cookie This would increase the number of page views. Negative: less consistent experience for user. Variability of metric will decrease because it is same as unit of analysis Target experiment to specific traffic Restricting to english traffic only will prevent dilutting the traffic Could also impact choice of practical significance boundary. Change metric to cookie based click through probability Will reduce it but not that much

Answer 46

P(A or B) = P(A) + P(B) - P(A and B)

Answer 47

P(A and B) = P(A) * P(B) [only for independent events]

Answer 48

z = (x,i - min(x) ) / (max(x) - min(x))

Answer 49

Answer : Logistic Regression is the Binary Classification. It is a statistical model that uses the logit function on the top of the probability to give 0 or 1 as a result. The loss function in LR is known as the Log Loss function.

Answer 50

Answer : The major difference between Regression and Classification is that Regression results in a continuous quantitative value while Classification is predicting the discrete labels. However, there is no clear line that draws the difference between the two. We have a few properties of both Regression and Classification. These are as follows: Regression Regression predicts the quantity. We can have discrete as well as continuous values as input for regression. If input data are ordered with respect to the time it becomes time series forecasting. Classification The Classification problem for two classes is known as Binary Classification. Classification can be split into Multi- Class Classification or Multi-Label Classification. We focus more on accuracy in Classification while we focus more on the error term in Regression.

Answer 51

Evaluation Metrics are statistical measures of model performance. They are very important because to determine the performance of any model it is very significant to use various Evaluation Metrics. Few of the evaluation Metrics are – Accuracy, Log Loss, Confusion Matrix. Confusion Matrix is a matrix to find the performance of a Classification model. It is in general a 2×2 matrix with one side as prediction and the other side as actual values.

Answer 52

accuracy: overall performance of model: (TP+TN) / (TP+FP+TN+TP) precision: how accurate the positive predictions are: (TP)/(TP+FP) recall: coverage of actual positive sample: TP/(TP+FN) f1 score: hybrid metric useful for unbalanced classes: (2TP)/(2TP + FP + FN)

Answer 53

For analyzing the data we cannot proceed with the whole volume at once for large datasets. We need to take some samples from the data which can represent the whole population. While making a sample out of complete data, we should take that data which can be a true representative of the whole data set. There are mainly two types of Sampling techniques based on Statistics. Probability Sampling and Non Probability Sampling Probability Sampling – Simple Random, Clustered Sampling, Stratified Sampling. Non Probability Sampling – Convenience Sampling, Quota Sampling, Snowball Sampling.

Answer 54

Rejection of True Null Hypothesis is known as a Type 1 error. In simple terms, False Positive are known as a Type 1 Error. Not rejecting the False Null Hypothesis is known as a Type 2 error. False Negatives are known as a Type 2 error. Type 1 Error is significant where the importance of being negative becomes significant. For example – If a man is not suffering from a particular disease marked as positive for that infection. The medications given to him might damage his organs. While Type 2 Error is significant in cases where the importance of being positive becomes important. For example – The alarm has to be raised in case of burglary in a bank. But a system identifies it as a False case that won’t raise the alarm on time resulting in a heavy loss.

Answer 55

In Overfitting the model performs well for the training data, but for any new data it fails to provide output. For Underfitting the model is very simple and not able to identify the correct relationship. Following are the bias and variance conditions. Overfitting – Low bias and High Variance results in overfitted model. Decision tree is more prone to Overfitting. Underfitting – High bias and Low Variance. Such model doesn’t perform well on test data also. For example – Linear Regression is more prone to Underfitting.

Answer 56

Normalisation is a process of bringing the features in a simple range, so that model can perform well and do not get inclined towards any particular feature. For example – If we have a dataset with multiple features and one feature is the Age data which is in the range 18-60 , Another feature is the salary feature ranging from 20000 – 2000000. In such a case, the values have a very much difference in them. Age ranges in two digits integer while salary is in range significantly higher than the age. So to bring the features in comparable range we need Normalisation. Both Normalisation and Standardization are methods of Features Conversion. However, the methods are different in terms of the conversions. The data after Normalisation scales in the range of 0-1. While in case of Standardization the data is scaled such that it means comes out to be 0.

Answer 57

Regulation is a method to improve your model which is Overfitted by introducing extra terms in the loss function. This helps in making the model performance better for unseen data. There are two types of Regularisation : L1 Regularisation – In L1 we add lambda times the absolute weight terms to the loss function. In this the feature weights are penalised on the basis of absolute value. L2 Regularisation – In L2 we add lambda times the squared weight terms to the loss function. In this the feature weights are penalised on the basis of squared values.

Answer 58

Decision tree is a Supervised Machine Learning approach. It uses the predetermined decisions data to prepare a model based on previous output. It follows a system to identify the pattern and predict the classes or output variable from previous output . The Decision tree works in the following manner – It takes the complete set of Data and try to identify a point with highest information gain and least entropy to mark it as a data node and proceed further in this manner. Entropy and Information gain are deciding factor to identify the data node in a Decision Tree.

Answer 59

Ensemble Learning is a process of accumulating multiple models to form a better prediction model. In Ensemble Learning the performance of the individual model contributes to the overall development in every step. There are two common techniques in this – Bagging and Boosting. Bagging – In this the data set is split to perform parallel processing of models and results are accumulated based on performance to achieve better accuracy. Boosting – This is a sequential technique in which a result from one model is passed to another model to reduce error at every step making it a better performance model. The most important example of Ensemble Learning is Random Forest Classifier. It takes multiple Decision Tree combined to form a better performance Random Forest model.

Answer 60

Naive Bayes Classifier algorithm is a probabilistic model. This model works on the Bayes Theorem principle. The accuracy of Naive Bayes can be increased significantly by combining it with other kernel functions for making a perfect Classifier. Bayes Theorem – This is a theorem which explains the conditional probability. If we need to identify the probability of occurrence of Event A provided the Event B has already occurred such cases are known as Conditional Probability.

Answer 61

If a data is distributed across different categories and the distribution is highly imbalance. Such data are known as Imbalance Data. These kind of datasets causes error in model performance by making category with large values significant for the model resulting in an inaccurate model. There are various techniques to handle imbalance data. We can increase the number of samples for minority classes. We can decrease the number of samples for classes with extremely high numbers of data points. We can use a cluster based technique to increase number of Data points for all the categories.

Answer 62

Grouping the data into different clusters based on the distribution of data is known as Clustering technique. There are various Clustering Techniques – 1. Density Based Clustering – DBSCAN , HDBSCAN 2. Hierarchical Clustering. 3. Partition Based Clustering 4. Distribution Based Clustering.

Answer 63

Cross Validation is a model performance improvement technique. This is a Statistics based approach in which the model gets to train and tested with rotation within the training dataset so that model can perform well for unknown or testing data. In this the training data are split into different groups and in rotation those groups are used for validation of model performance. The common Cross Validation techniques are – K- Fold Cross Validation Leave p-out Cross Validation Leave-one-out cross-validation. Holdout method

Answer 64

Deep Learning is the branch of Machine Learning and AI which tries to achieve better accuracy and able to achieve complex models. Deep Learning models are similar to human brains like structure with input layer, hidden layer, activation function and output layer designed in a fashion to give a human brain like structure. Deep Learning have so many real time applications – Self Driving Cars Computer Vision and Image Processing Real Time Chat bots Home Automation Systems

Answer 65

Unsupervised learning is a machine learning technique, where you do not need to supervise the model. Instead, you need to allow the model to work on its own to discover information. It mainly deals with the unlabelled data. Unsupervised learning algorithms allow you to perform more complex processing tasks compared to supervised learning. Although, unsupervised learning can be more unpredictable compared with other natural learning deep learning and reinforcement learning methods. examples: clustering, association

Answer 66

In Supervised learning, you train the machine using data which is well "labeled." It means some data is already tagged with the correct answer. It can be compared to learning which takes place in the presence of a supervisor or a teacher. A supervised learning algorithm learns from labeled training data, helps you to predict outcomes for unforeseen data. Successfully building, scaling, and deploying accurate supervised machine learning Data science model takes time and technical expertise from a team of highly skilled data scientists. Moreover, Data scientist must rebuild models to make sure the insights given remains true until its data changes. examples: regression, classification