Data Science Flashcards
We’re given two tables, a table of notification deliveries and a table of users with created and purchase conversion dates. If the user hasn’t purchased then the conversion_date
column is NULL.
`notification_deliveries` table: column type notification varchar user_id int created_at datetime
`users` table: column type user_id int created_at datetime conversion_date datetime
- Write a query to get the distribution of total push notifications before a user converts.
- Write a query to get the conversion rate for each notification.
Example Output:
notification conversion_rate
activate_premium 0.05
try_premium 0.03
free_trial 0.11
Problem 1 Possible Solutions:
select ct, count(*) from (select n.user_id, count(notification) as ct from notification_deliveries n join users u n.user_id = u.user_id where conversion_date is not null and n.created_at < u.conversion_date) temp group by ct
select pushes, count() from( select t1.user_id, count() as pushes from users t1 left join notification_deliveries t2 on t1.user_id = t2.user_id and t1.convertion_date >= t2.created_at where conversion_date is not null group by 1) tmp2 group by 1
Problem 2 Possible Solutions
select notification avg(converted) from (select *, (case when conversion_date is not null then 1 else 0 end) as converted from notification_deliveries n join users u n.user_id = u.user_id) temp group by notification
select notification, sum(case when conversion_date is not null then 1 else 0 end) / count(*) as conversion_rate from users t1 left join notification_deliveries t2 on t1.user_id = t2.user_id and t1.convertion_date >= t2.created_at group by 1
A dating websites schema is represented by a table of people that like other people. The table has three columns. One column is the user_id, another column is the liker_id which is the user_id of the user doing the liking, and the last column is the date time that the like occured.
Write a query to count the number of liker’s likers (the users that like the likers) if the liker has one.
likes table:
column type
user_id int
created_at datetime
liker_id int
input:
user liker A B B C B D D E
output:
user count
B 2
D 1
select user_id, count(liker_id) as count from likes where user_id in ( select liker_id from likes group by liker_id) group by user_id
select user_id, count(liker_id) from likes where user in (select distinct(likers) from likes) group by user order by user
Suppose we have a binary classification model that classifies whether or not an applicant should be qualified to get a loan. Because we are a financial company we have to provide each rejected applicant with a reason why.
Given we don’t have access to the feature weights, how would we give each rejected applicant a reason why they got rejected?
Given we do not have access to the feature weights, we are unable to tell each applicant which were the highest contributing factors to their application rejection. However, if we have enough results, we can start to build a sample distribution of application outcomes, and then map them to the particular characteristics of each rejection.
For example, if a rejected applicant had a recurring outstanding credit card balance of 10% of their monthly take-home income: if we know that the percentile of this data point falls within the middle of the distribution of rejected applicants, we can be fairly certain it is at least correlated with their rejection outcome. With this methodology, we can outline a few standard factors that may have led to the decision.
Given a list of timestamps in sequential order, return a list of lists grouped by week (7 days) using the first timestamp as the starting point.
Example:
ts = [ '2019-01-01', '2019-01-02', '2019-01-08', '2019-02-01', '2019-02-02', '2019-02-05', ]
output = [ ['2019-01-01', '2019-01-02'], ['2019-01-08'], ['2019-02-01', '2019-02-02'], ['2019-02-05'], ]
from datetime import datetime as dt
from itertools import groupby
inp = ['2019-01-01','2019-01-02','2019-01-08', '2019-02-01','2019-02-05'] first = dt.strptime(inp[0], "%Y-%m-%d") out = []
for k, g in groupby(inp, key=lambda d: (dt.strptime(d, “%Y-%m-%d”) - first).days // 7 ):
out.append(list(g))
print out
from collections import defaultdict from datetime import datetime as dt curr = '2019-01-01' idx = 0 dic = defaultdict(list) for i in ts: if ( dt.strptime(i, '%Y-%m-%d') - dt.strptime(curr, '%Y-%m-%d')).days < 7 : dic[idx].append(i) else: curr = i idx += 1 dic[idx].append(i) print(dic.values())
Explain what regularization is and why it is useful
- process of adding a penalty to the cost function of a model to shrink the coefficient estimates
- useful by helping to prevent overfitting
- most common forms are L1(Lasso) and L2(Ridge)
- advantage of lasso is that it can force coefficients to be zero and act as a feautre selector
How do you solve for multicollinearity?
-multicollinearity occurs when independent variables in a model are correlated
- problem: makes model difficult to interpret
- standard errors become overinflated and makes some variables statistically insignificant when they should be significant
- solution:
1. remove highly correlated features prior training the model (using forward or backward selection)
2. use lasso regularization to force coefficients to zero
3. use PCA to reduce the number of features and end up with linear features - example:
a. if we have a linear regression model with correlated features X and Z as inputs and Y as output
b. true effect of X on Y is hard to differentiate from the true effect of Z on Y
c. this is because if we increase ZX, Z will also increase/decrease
d. coefficient of X can be interpreted as increase in Y for every unit we increase for X while holding Z constant
- what is overfitting?
- why is it a problem in machine learning models?
- what steps can you take to avoid it?
- overfitting occurs when a model fits too closely to the training data and just memorizes it
- generalizes poorly on future, unseen data - Problem: generalizes poorly on future, unseen data
- model hasn’t actually learned the signal (just the noise) and will have near zero predictive capabilities - Methods to reduce
- cross validation to estimate model’s performance on unseen data
- ensembling techniques to reduce variance(bagging, stacking, blending)
- regularization techniques that add a penalty to the cost function and makes models less flexible
Explain the difference between generative and discriminative algorithms
Suppose we have a dataset with training input x and labels y.
Generative model: explicitly models the actual distribution of each class.
- It learns the joint probability distribution, p(x,y) and then uses Bayes’ Theorem to calculate p(y|x)
- then picks the most likeley label y
- examples: Naive Bayes, Bayesian Networks and Markov Random Fields
Discriminative Model: learns the conditional probability distribution p(y|x) or a direct mapping from inputs x to the class labels y
- models the decision boundary between the classes
- examples: logistic regression, neural netowrks and nearest neighbors
Explain the bias-variance tradeoff
bias: error caused from oversimplification of your model(underfitting)
variance: error caused from having a too complex model(overfitting)
- there exists a tradeoff because models iwth low bias will usually have a higher variance and vice versa.
- key is to minimize both bias and variance
is more data always better?
no. related to Big Data hubris or the idea that big data is a substitute, rather than a supplement to, traditional data collection and analysis
- depends on the problem and quality of data
- if the data you keep collecting is constantly biased in some way, then obtaining more data is not helpful
- keep in mind the tradeoff between having more data vs dealing with additional storage, increased memory and needing more computational power
what are feature vectors?
feature vecto: n-dimensional vector of numerical features that represent some object and can be represented as a point in n-dimensional space
how do you know if one algorithm is better than others?
better can mean a lot of different things:
- better on training set?
- several training sets?
- faster?
- more space efficient
this answer depends on the problem, goal and contstraints
explain the difference between supervised and unsuperviced machine learning
supervised machine learning algorithms: we provide labeled data (ex spam or not spam, cats or not cats) so the model can learn the mapping from inputs to labeled outputs.
unsupervised learning: we dont need to have labeled data and goal is to detect patterns or learn representations of the data
-example: detecting anomalies or finding similar groupings of customers
what is the difference between convex and non-convex functions?
convex: one minimum
- important: an optimization algorithm(like gradient descent) wont get stuck in a local minimum
non-convex: some up and down valleys (local minimas) that aren’t as down as the overall down (global minum)
-optimization algorithms can get stuck in local minimum and it can be hard to tell when this happens
- explain gradient descent
2. what is the difference between local and global optimum
- gradient descent: optimization algorithm used to minimize some functions by iteratively moving in the direction of steepest descent as defined by the negative of the gradient.
- by moving in direction of negative gradient, we slowly make our way down to a lower point until we get to the bottom (local minima)
- we use gradient descent to update the parameters of our model - local optimum: solution that is the best among neighboring solutions, not the best solution overall
global optimum: best solution overall
suppose you have the following two lists: a = [42,84,3528,1764] b=[42,42,42,42] What does the following piece of code do? How can you make it run faster? >>>total = 0 >>>for idx, val in enumerate(a): >>> total += a[idx]*b[idx] >>>return total
Essentially the dot product between two 1 dimensional vecotrs. can use np.dot(np.array(a),np.array(b)) instead