Data Science Interview Flashcards
Give an example of how you would use experimental design to answer a question about user behavior.
step 1: Formulate the Research Question.
e.g. what are the effects of page load times on user satisfaction?
step 2: Identify Variables.
We identify the cause and effect. Independent variable is page load times; dependent variable is user-satisfaction rating.
step 3: Generate Hypothesis.
e.g. lower page download time will have more effect on user satisfaction rating for a web page. Here the factor we analyze is page load time.
step 4: Determine Experimental Design.
We consider experimental complexity, i.e. vary one factor at a time or multiple factors at a time in which case we use factorial design (2^k design). A design is also selected based on the type of objective (Comparative, Screening, Response surface) and number of factors.
Here we also identify within-participants, between participants, and mixed-model. e.g. there are two versions of a page, one with “buy” button (call to action) on left and the other version has the button on right.
Within-participant design–both user groups see both versions.
Between-participants design–each group sees different version.
step 5: Develop Experimental Task and Procedure.
Detailed description involved in the experiment, tools used to measure behavior, goals and success metrics should be defined. Collect qualitative data about user engagement to allow statistical analysis.
step 6: Determine Manipulation and Measurements.
Manipulation: One level of factor will be controlled and the other will be manipulated. We also identify the behavioral measures:
- Latency: time between prompt and occurrence of behavior (how long it takes user to click buy after being presented with products).
- Frequency: number of times a behavior occurs (number of times a user clicks on a given page within a time)
- Duration length: length of time a specific behavior lasts (time taken to add all products).
- Intensity: the force with which a behavior occurs (how quickly a user purchased a product.
Step 7: Analyze Results
Identify user behavior data and support hypothesis or contradict according to the observations made, e.g. how the majority of users’ satisfaction ratings compared with their page load times
Why is Naive Bayes “naive”?
Naive Bayes is “naive” because it assumes all features in the data set are equally important and independent.
In the real world, these assumptions are rarely true.
You find your model suffers from low bias, high variance. Which algo can you utilize to remedy it?
Low bias occurs when the model’s predicted values are NEAR actual values, i.e., the model becomes flexible enough to mimic the training data distribution.
However, a flexible model is not generalizable and tends to overfit the training data.
We can use the BAGGING algo (e.g. random forest) to address the high variance problem.
Bagging algos divide the data set into subsets made with REPEATED RANDOM SAMPLING. Then, these samples are used to generate a SET of MODELS using a single learning algo. Finally, the model predictions are COMBINED using majority vote (classification) or averaging (regression).
We can also use REGULARIZATION, where models with MANY COEFs get penalized, thus lowering model complexity.
We can also use top n features from random forest feature importances.
Explain how to select important variables in models.
Following are variable selection methods:
- remove correlated variables prior to selecting important features.
- Use linear regression and select variables based on p values.
- Use fwd, backwards, stepwise selection
- Use Rforest, XGBoost, and obtain variable importances.
- Use Lasso regression to automatically do feature selection
- Measure information gain from the available set of features and select up to top n features accordingly.
What is the expected number of flips of a fair coin to obtain 3 heads in a row?
The geometric dist is the number of X Bernoulli trials required before a first success.
The geometric dist gives the proba that the first occurrence of successes requires k independent trials, ea with success proba p that the kth trial is the first success, i.e. P(X=k) = (1-p)^k-1 *p.
The mean of a geometric dist for the number of trials, A1, before the first success is 1/p
A2 = (1+p) /p^2 = 6
A3 = (1+p+p^2) / p^3 = 14
the formula for an expected trials streak until X successes is, generally, E[X] = (p^-n -1) / 1-p
E[X] = (.5^-3 -1) / .5 = (8 -1)/.5 = 7/.5 = 7 / (1/2) = (2*7) /1 = 14