Data Science Flashcards
We’re given two tables, a table of notification deliveries and a table of users with created and purchase conversion dates. If the user hasn’t purchased then the conversion_date
column is NULL.
`notification_deliveries` table: column type notification varchar user_id int created_at datetime
`users` table: column type user_id int created_at datetime conversion_date datetime
- Write a query to get the distribution of total push notifications before a user converts.
- Write a query to get the conversion rate for each notification.
Example Output:
notification conversion_rate
activate_premium 0.05
try_premium 0.03
free_trial 0.11
Problem 1 Possible Solutions:
select ct, count(*) from (select n.user_id, count(notification) as ct from notification_deliveries n join users u n.user_id = u.user_id where conversion_date is not null and n.created_at < u.conversion_date) temp group by ct
select pushes, count() from( select t1.user_id, count() as pushes from users t1 left join notification_deliveries t2 on t1.user_id = t2.user_id and t1.convertion_date >= t2.created_at where conversion_date is not null group by 1) tmp2 group by 1
Problem 2 Possible Solutions
select notification avg(converted) from (select *, (case when conversion_date is not null then 1 else 0 end) as converted from notification_deliveries n join users u n.user_id = u.user_id) temp group by notification
select notification, sum(case when conversion_date is not null then 1 else 0 end) / count(*) as conversion_rate from users t1 left join notification_deliveries t2 on t1.user_id = t2.user_id and t1.convertion_date >= t2.created_at group by 1
A dating websites schema is represented by a table of people that like other people. The table has three columns. One column is the user_id, another column is the liker_id which is the user_id of the user doing the liking, and the last column is the date time that the like occured.
Write a query to count the number of liker’s likers (the users that like the likers) if the liker has one.
likes table:
column type
user_id int
created_at datetime
liker_id int
input:
user liker A B B C B D D E
output:
user count
B 2
D 1
select user_id, count(liker_id) as count from likes where user_id in ( select liker_id from likes group by liker_id) group by user_id
select user_id, count(liker_id) from likes where user in (select distinct(likers) from likes) group by user order by user
Suppose we have a binary classification model that classifies whether or not an applicant should be qualified to get a loan. Because we are a financial company we have to provide each rejected applicant with a reason why.
Given we don’t have access to the feature weights, how would we give each rejected applicant a reason why they got rejected?
Given we do not have access to the feature weights, we are unable to tell each applicant which were the highest contributing factors to their application rejection. However, if we have enough results, we can start to build a sample distribution of application outcomes, and then map them to the particular characteristics of each rejection.
For example, if a rejected applicant had a recurring outstanding credit card balance of 10% of their monthly take-home income: if we know that the percentile of this data point falls within the middle of the distribution of rejected applicants, we can be fairly certain it is at least correlated with their rejection outcome. With this methodology, we can outline a few standard factors that may have led to the decision.
Given a list of timestamps in sequential order, return a list of lists grouped by week (7 days) using the first timestamp as the starting point.
Example:
ts = [ '2019-01-01', '2019-01-02', '2019-01-08', '2019-02-01', '2019-02-02', '2019-02-05', ]
output = [ ['2019-01-01', '2019-01-02'], ['2019-01-08'], ['2019-02-01', '2019-02-02'], ['2019-02-05'], ]
from datetime import datetime as dt
from itertools import groupby
inp = ['2019-01-01','2019-01-02','2019-01-08', '2019-02-01','2019-02-05'] first = dt.strptime(inp[0], "%Y-%m-%d") out = []
for k, g in groupby(inp, key=lambda d: (dt.strptime(d, “%Y-%m-%d”) - first).days // 7 ):
out.append(list(g))
print out
from collections import defaultdict from datetime import datetime as dt curr = '2019-01-01' idx = 0 dic = defaultdict(list) for i in ts: if ( dt.strptime(i, '%Y-%m-%d') - dt.strptime(curr, '%Y-%m-%d')).days < 7 : dic[idx].append(i) else: curr = i idx += 1 dic[idx].append(i) print(dic.values())
Explain what regularization is and why it is useful
- process of adding a penalty to the cost function of a model to shrink the coefficient estimates
- useful by helping to prevent overfitting
- most common forms are L1(Lasso) and L2(Ridge)
- advantage of lasso is that it can force coefficients to be zero and act as a feautre selector
How do you solve for multicollinearity?
-multicollinearity occurs when independent variables in a model are correlated
- problem: makes model difficult to interpret
- standard errors become overinflated and makes some variables statistically insignificant when they should be significant
- solution:
1. remove highly correlated features prior training the model (using forward or backward selection)
2. use lasso regularization to force coefficients to zero
3. use PCA to reduce the number of features and end up with linear features - example:
a. if we have a linear regression model with correlated features X and Z as inputs and Y as output
b. true effect of X on Y is hard to differentiate from the true effect of Z on Y
c. this is because if we increase ZX, Z will also increase/decrease
d. coefficient of X can be interpreted as increase in Y for every unit we increase for X while holding Z constant
- what is overfitting?
- why is it a problem in machine learning models?
- what steps can you take to avoid it?
- overfitting occurs when a model fits too closely to the training data and just memorizes it
- generalizes poorly on future, unseen data - Problem: generalizes poorly on future, unseen data
- model hasn’t actually learned the signal (just the noise) and will have near zero predictive capabilities - Methods to reduce
- cross validation to estimate model’s performance on unseen data
- ensembling techniques to reduce variance(bagging, stacking, blending)
- regularization techniques that add a penalty to the cost function and makes models less flexible
Explain the difference between generative and discriminative algorithms
Suppose we have a dataset with training input x and labels y.
Generative model: explicitly models the actual distribution of each class.
- It learns the joint probability distribution, p(x,y) and then uses Bayes’ Theorem to calculate p(y|x)
- then picks the most likeley label y
- examples: Naive Bayes, Bayesian Networks and Markov Random Fields
Discriminative Model: learns the conditional probability distribution p(y|x) or a direct mapping from inputs x to the class labels y
- models the decision boundary between the classes
- examples: logistic regression, neural netowrks and nearest neighbors
Explain the bias-variance tradeoff
bias: error caused from oversimplification of your model(underfitting)
variance: error caused from having a too complex model(overfitting)
- there exists a tradeoff because models iwth low bias will usually have a higher variance and vice versa.
- key is to minimize both bias and variance
is more data always better?
no. related to Big Data hubris or the idea that big data is a substitute, rather than a supplement to, traditional data collection and analysis
- depends on the problem and quality of data
- if the data you keep collecting is constantly biased in some way, then obtaining more data is not helpful
- keep in mind the tradeoff between having more data vs dealing with additional storage, increased memory and needing more computational power
what are feature vectors?
feature vecto: n-dimensional vector of numerical features that represent some object and can be represented as a point in n-dimensional space
how do you know if one algorithm is better than others?
better can mean a lot of different things:
- better on training set?
- several training sets?
- faster?
- more space efficient
this answer depends on the problem, goal and contstraints
explain the difference between supervised and unsuperviced machine learning
supervised machine learning algorithms: we provide labeled data (ex spam or not spam, cats or not cats) so the model can learn the mapping from inputs to labeled outputs.
unsupervised learning: we dont need to have labeled data and goal is to detect patterns or learn representations of the data
-example: detecting anomalies or finding similar groupings of customers
what is the difference between convex and non-convex functions?
convex: one minimum
- important: an optimization algorithm(like gradient descent) wont get stuck in a local minimum
non-convex: some up and down valleys (local minimas) that aren’t as down as the overall down (global minum)
-optimization algorithms can get stuck in local minimum and it can be hard to tell when this happens
- explain gradient descent
2. what is the difference between local and global optimum
- gradient descent: optimization algorithm used to minimize some functions by iteratively moving in the direction of steepest descent as defined by the negative of the gradient.
- by moving in direction of negative gradient, we slowly make our way down to a lower point until we get to the bottom (local minima)
- we use gradient descent to update the parameters of our model - local optimum: solution that is the best among neighboring solutions, not the best solution overall
global optimum: best solution overall
suppose you have the following two lists: a = [42,84,3528,1764] b=[42,42,42,42] What does the following piece of code do? How can you make it run faster? >>>total = 0 >>>for idx, val in enumerate(a): >>> total += a[idx]*b[idx] >>>return total
Essentially the dot product between two 1 dimensional vecotrs. can use np.dot(np.array(a),np.array(b)) instead
Define the Central Limit Theorem and its importance
CLT repeatedly take independent random samples of size n from a population(for both normal and nonnormal data)
-when n is large, distribution of sample means will approach a normal distribution
Imporatnce
- allows us to make inferences from a sample about a population, without needing the characteristics of the whole population.
- confidence intervals, hypothesis testing and pvalue analysis are all based on the CLT
Define Law of large numbers and its importance
LNN states that if an experiment is repeated independently a large number of times and you take the average of the results
-average should be close to expected value (or mathematically proven result)
example:
- toss a coin 42x vs 420000000x: expect percentage of heads/tails to be clsoer to 50% for the latter
- implies that large sample sizes are more reflective of reality than samll sample sizes
what is the normal distribution? What are some examples of data that follow it?
-also known as gaussian distribution
-allows us to perform parametric hypothesis testing
most of the observations cluster around the mean
-68% first standard deviation
-95.4% second standard deviation
-99.7% third standard deviation
examples: height, weight, shoe size, test scores, blood pressure, daily return of stocks
how do you check if a distribution is close to normal
- QQ plots: plots 2 sets of quantiles against one another, if both sets of quantiles came from the same distribution, the points should form a line that is roughly straight
- kolmogrov-Smirnov test
- what is a long tailed distribution?
2. what are some examples of data that follow it? Why is it important in machine learning?
long tailed distribution (or Pareto): when data is clustered around the head and gradually levels off to zero
- large number of occurrences is accounted for by a small number of items
- known as 80-20 rule, 90% effects come from 20% of causes
examples: frequency of earthquakes(large number of small magnitude earthquakes, few large magnitude ones), search engines(few keyboards that are commonly searched for)
2. important in ML: applied by saying that 20% of data might be useful or that 80% of your time will be sepnt on one part of the data science project(usually data cleaning)
- what is ax = b?
2. how does one solve it
- ax = b is one way to specify a set of linear equations, a is an (m,n) matrix, b is a vector with m entries, x is an unkown vector with n entries(which we are trying to solve for)
- ax: we are multiplying matrix A and vector x
- Ax = b has a solution iff b is a linear combination of the columns of A - solution
- find x by taking inverse of A: x = A^-1b
- we can solve Ax = b by created an augmented matrix [A b] by attaching b to A as a column on the irght
- reduce [A b] to reduce row echelon form
- if you still have a system that is solvable, then any of its solutions will be a solution to ur original equation
how does one multiply matrices?
- scalar multiplication: every entry is multiplied by a number(scalar)
- matrix multiplication (dot product): multiply 2 matrices A and B together
- only can be done if number of columns in matrix A equals number of rows in matrix B
- if size of A is a x b and size of B is b x c, result of matrix is a x c
- multiplication is done by multiplying corresponding members from each matrix and then summing them up
- matrix multiplication is not commutative (AB != BA)
what are:
a. eigenvalues
b. eigenvectors
A scalar is called an eigenvalue of an n x n matrix A if there is a nontrivial solution x such that Ax = scalarx and x is the eigenvector corresponding to the eigenvalue scalar
Eigenvectors tell you when the linear transformation represented by acts like scalar multiplication
Eigenvalues are the amounts by which they are scaled