Data Science Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

We’re given two tables, a table of notification deliveries and a table of users with created and purchase conversion dates. If the user hasn’t purchased then the conversion_date column is NULL.

`notification_deliveries` table:
column	type
notification	varchar
user_id	int
created_at	datetime
`users` table:
column	type
user_id	int
created_at	datetime
conversion_date	datetime
  1. Write a query to get the distribution of total push notifications before a user converts.
  2. Write a query to get the conversion rate for each notification.

Example Output:

notification conversion_rate
activate_premium 0.05
try_premium 0.03
free_trial 0.11

A

Problem 1 Possible Solutions:

select ct, count(*) from (select n.user_id, count(notification) as ct from notification_deliveries n join users u n.user_id = u.user_id where conversion_date is not null and n.created_at < u.conversion_date) temp group by ct

select pushes, count() from( select t1.user_id, count() as pushes from users t1 left join notification_deliveries t2 on t1.user_id = t2.user_id and t1.convertion_date >= t2.created_at where conversion_date is not null group by 1) tmp2 group by 1

Problem 2 Possible Solutions
select notification avg(converted) from (select *, (case when conversion_date is not null then 1 else 0 end) as converted from notification_deliveries n join users u n.user_id = u.user_id) temp group by notification

select notification, sum(case when conversion_date is not null then 1 else 0 end) / count(*) as conversion_rate from users t1 left join notification_deliveries t2 on t1.user_id = t2.user_id and t1.convertion_date >= t2.created_at group by 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

A dating websites schema is represented by a table of people that like other people. The table has three columns. One column is the user_id, another column is the liker_id which is the user_id of the user doing the liking, and the last column is the date time that the like occured.

Write a query to count the number of liker’s likers (the users that like the likers) if the liker has one.

likes table:

column type
user_id int
created_at datetime
liker_id int

input:

user	liker
A	B
B	C
B	D
D	E

output:

user count
B 2
D 1

A
select user_id, count(liker_id) as count
from likes
where user_id in (
select liker_id 
from likes
group by liker_id)
group by user_id
select user_id, count(liker_id) 
from likes
where user in 
(select distinct(likers) from likes) 
group by user 
order by user
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Suppose we have a binary classification model that classifies whether or not an applicant should be qualified to get a loan. Because we are a financial company we have to provide each rejected applicant with a reason why.

Given we don’t have access to the feature weights, how would we give each rejected applicant a reason why they got rejected?

A

Given we do not have access to the feature weights, we are unable to tell each applicant which were the highest contributing factors to their application rejection. However, if we have enough results, we can start to build a sample distribution of application outcomes, and then map them to the particular characteristics of each rejection.

For example, if a rejected applicant had a recurring outstanding credit card balance of 10% of their monthly take-home income: if we know that the percentile of this data point falls within the middle of the distribution of rejected applicants, we can be fairly certain it is at least correlated with their rejection outcome. With this methodology, we can outline a few standard factors that may have led to the decision.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Given a list of timestamps in sequential order, return a list of lists grouped by week (7 days) using the first timestamp as the starting point.

Example:

ts = [
    '2019-01-01', 
    '2019-01-02',
    '2019-01-08', 
    '2019-02-01', 
    '2019-02-02',
    '2019-02-05',
]
output = [
    ['2019-01-01', '2019-01-02'], 
    ['2019-01-08'], 
    ['2019-02-01', '2019-02-02'],
    ['2019-02-05'],
]
A

from datetime import datetime as dt
from itertools import groupby

inp = ['2019-01-01','2019-01-02','2019-01-08', '2019-02-01','2019-02-05']
first = dt.strptime(inp[0], "%Y-%m-%d")
out = []

for k, g in groupby(inp, key=lambda d: (dt.strptime(d, “%Y-%m-%d”) - first).days // 7 ):
out.append(list(g))

print out

from collections import defaultdict
from datetime import datetime as dt
curr = '2019-01-01'
idx = 0
dic = defaultdict(list)
for i in ts:
 if ( dt.strptime(i, '%Y-%m-%d') -  dt.strptime(curr, '%Y-%m-%d')).days < 7 :
     dic[idx].append(i)
 else:
    curr = i
    idx += 1
    dic[idx].append(i)
print(dic.values())
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain what regularization is and why it is useful

A
  • process of adding a penalty to the cost function of a model to shrink the coefficient estimates
  • useful by helping to prevent overfitting
  • most common forms are L1(Lasso) and L2(Ridge)
  • advantage of lasso is that it can force coefficients to be zero and act as a feautre selector
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you solve for multicollinearity?

A

-multicollinearity occurs when independent variables in a model are correlated

  • problem: makes model difficult to interpret
  • standard errors become overinflated and makes some variables statistically insignificant when they should be significant
  • solution:
    1. remove highly correlated features prior training the model (using forward or backward selection)
    2. use lasso regularization to force coefficients to zero
    3. use PCA to reduce the number of features and end up with linear features
  • example:
    a. if we have a linear regression model with correlated features X and Z as inputs and Y as output
    b. true effect of X on Y is hard to differentiate from the true effect of Z on Y
    c. this is because if we increase ZX, Z will also increase/decrease
    d. coefficient of X can be interpreted as increase in Y for every unit we increase for X while holding Z constant
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. what is overfitting?
  2. why is it a problem in machine learning models?
  3. what steps can you take to avoid it?
A
  1. overfitting occurs when a model fits too closely to the training data and just memorizes it
    - generalizes poorly on future, unseen data
  2. Problem: generalizes poorly on future, unseen data
    - model hasn’t actually learned the signal (just the noise) and will have near zero predictive capabilities
  3. Methods to reduce
    - cross validation to estimate model’s performance on unseen data
    - ensembling techniques to reduce variance(bagging, stacking, blending)
    - regularization techniques that add a penalty to the cost function and makes models less flexible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the difference between generative and discriminative algorithms

A

Suppose we have a dataset with training input x and labels y.

Generative model: explicitly models the actual distribution of each class.

  • It learns the joint probability distribution, p(x,y) and then uses Bayes’ Theorem to calculate p(y|x)
  • then picks the most likeley label y
  • examples: Naive Bayes, Bayesian Networks and Markov Random Fields

Discriminative Model: learns the conditional probability distribution p(y|x) or a direct mapping from inputs x to the class labels y

  • models the decision boundary between the classes
  • examples: logistic regression, neural netowrks and nearest neighbors
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain the bias-variance tradeoff

A

bias: error caused from oversimplification of your model(underfitting)
variance: error caused from having a too complex model(overfitting)

  • there exists a tradeoff because models iwth low bias will usually have a higher variance and vice versa.
  • key is to minimize both bias and variance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

is more data always better?

A

no. related to Big Data hubris or the idea that big data is a substitute, rather than a supplement to, traditional data collection and analysis

  • depends on the problem and quality of data
  • if the data you keep collecting is constantly biased in some way, then obtaining more data is not helpful
  • keep in mind the tradeoff between having more data vs dealing with additional storage, increased memory and needing more computational power
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what are feature vectors?

A

feature vecto: n-dimensional vector of numerical features that represent some object and can be represented as a point in n-dimensional space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

how do you know if one algorithm is better than others?

A

better can mean a lot of different things:

  • better on training set?
  • several training sets?
  • faster?
  • more space efficient

this answer depends on the problem, goal and contstraints

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

explain the difference between supervised and unsuperviced machine learning

A

supervised machine learning algorithms: we provide labeled data (ex spam or not spam, cats or not cats) so the model can learn the mapping from inputs to labeled outputs.

unsupervised learning: we dont need to have labeled data and goal is to detect patterns or learn representations of the data
-example: detecting anomalies or finding similar groupings of customers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is the difference between convex and non-convex functions?

A

convex: one minimum
- important: an optimization algorithm(like gradient descent) wont get stuck in a local minimum

non-convex: some up and down valleys (local minimas) that aren’t as down as the overall down (global minum)
-optimization algorithms can get stuck in local minimum and it can be hard to tell when this happens

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  1. explain gradient descent

2. what is the difference between local and global optimum

A
  1. gradient descent: optimization algorithm used to minimize some functions by iteratively moving in the direction of steepest descent as defined by the negative of the gradient.
    - by moving in direction of negative gradient, we slowly make our way down to a lower point until we get to the bottom (local minima)
    - we use gradient descent to update the parameters of our model
  2. local optimum: solution that is the best among neighboring solutions, not the best solution overall
    global optimum: best solution overall
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
suppose you have the following two lists: 
a = [42,84,3528,1764]
b=[42,42,42,42]
What does the following piece of code do?
How can you make it run faster?
>>>total = 0
>>>for idx, val in enumerate(a):
>>>     total += a[idx]*b[idx]
>>>return total
A

Essentially the dot product between two 1 dimensional vecotrs. can use np.dot(np.array(a),np.array(b)) instead

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Define the Central Limit Theorem and its importance

A

CLT repeatedly take independent random samples of size n from a population(for both normal and nonnormal data)
-when n is large, distribution of sample means will approach a normal distribution

Imporatnce

  • allows us to make inferences from a sample about a population, without needing the characteristics of the whole population.
  • confidence intervals, hypothesis testing and pvalue analysis are all based on the CLT
18
Q

Define Law of large numbers and its importance

A

LNN states that if an experiment is repeated independently a large number of times and you take the average of the results
-average should be close to expected value (or mathematically proven result)

example:

  • toss a coin 42x vs 420000000x: expect percentage of heads/tails to be clsoer to 50% for the latter
  • implies that large sample sizes are more reflective of reality than samll sample sizes
19
Q

what is the normal distribution? What are some examples of data that follow it?

A

-also known as gaussian distribution
-allows us to perform parametric hypothesis testing
most of the observations cluster around the mean
-68% first standard deviation
-95.4% second standard deviation
-99.7% third standard deviation

examples: height, weight, shoe size, test scores, blood pressure, daily return of stocks

20
Q

how do you check if a distribution is close to normal

A
  1. QQ plots: plots 2 sets of quantiles against one another, if both sets of quantiles came from the same distribution, the points should form a line that is roughly straight
  2. kolmogrov-Smirnov test
21
Q
  1. what is a long tailed distribution?

2. what are some examples of data that follow it? Why is it important in machine learning?

A

long tailed distribution (or Pareto): when data is clustered around the head and gradually levels off to zero

  • large number of occurrences is accounted for by a small number of items
  • known as 80-20 rule, 90% effects come from 20% of causes

examples: frequency of earthquakes(large number of small magnitude earthquakes, few large magnitude ones), search engines(few keyboards that are commonly searched for)
2. important in ML: applied by saying that 20% of data might be useful or that 80% of your time will be sepnt on one part of the data science project(usually data cleaning)

22
Q
  1. what is ax = b?

2. how does one solve it

A
  1. ax = b is one way to specify a set of linear equations, a is an (m,n) matrix, b is a vector with m entries, x is an unkown vector with n entries(which we are trying to solve for)
    - ax: we are multiplying matrix A and vector x
    - Ax = b has a solution iff b is a linear combination of the columns of A
  2. solution
    - find x by taking inverse of A: x = A^-1b
    - we can solve Ax = b by created an augmented matrix [A b] by attaching b to A as a column on the irght
    - reduce [A b] to reduce row echelon form
    - if you still have a system that is solvable, then any of its solutions will be a solution to ur original equation
23
Q

how does one multiply matrices?

A
  1. scalar multiplication: every entry is multiplied by a number(scalar)
  2. matrix multiplication (dot product): multiply 2 matrices A and B together
    - only can be done if number of columns in matrix A equals number of rows in matrix B
    - if size of A is a x b and size of B is b x c, result of matrix is a x c
    - multiplication is done by multiplying corresponding members from each matrix and then summing them up
    - matrix multiplication is not commutative (AB != BA)
24
Q

what are:

a. eigenvalues
b. eigenvectors

A

A scalar is called an eigenvalue of an n x n matrix A if there is a nontrivial solution x such that Ax = scalarx and x is the eigenvector corresponding to the eigenvalue scalar

Eigenvectors tell you when the linear transformation represented by acts like scalar multiplication

Eigenvalues are the amounts by which they are scaled

25
Q

Given 2 fair dice, what’s the probability of getting socres that sum to 4? to 7?

A

When you roll 2 dice, there are 36 possible combinations:
-number of totals that sum up to 4 are three
(1,3), (2,2), (3,1) = 3/36 = 1/12
-number of totals that sum up to 7 are
{1,6),(2,5),(3,4)(4,3)(5,2)(6,1) = 6/36 = 1/6

26
Q

A jar has 1000 coins. 999 are fair and 1 is double headed. You pick a coin at random and toss it 10 times and they all come up heads. What’s the probability that the next toss is also head?

A

Bayes Theorem
-P(double headed coin | 10 coins) =
(P(10 Heads | Double-headed coin) * P(Double headed coin)) /
(P(10 heads | double headed coin) *P(double headed coin) + P(10 heads | fair coin) P(fair coin))

  • P(10 heads | double headed coin) = 1
  • P(double headed coin) = 1/1000 = .001
  • P(10 heads | fair coin) = (1/2)^1 * 0
  • P(fair coin) = 999/1000

-plug into equation/formula above:
(1.001) / (1 0.001 + .99 * .5^2)
=.5062

27
Q

You are offered a contract on a piece of land that is worth $800,000 50% of the time, $300,000 30% of the time and $100,000 20% of the time. The contract allows you to pay X dollars for a land appraisal and then you can decide whether or not to pay $200,000 for the land. How much is the contract worth? And what is X?

A

50% of the time you will make:
$600,000 ($800,000-200,000)

30% of the time you will make:
$100,000($300,000-200,000)

20% of the time you will lose:
-$100,000(100,000-200,000)

The value of the contract without an appraisal is
.5 * 600,000 + .3100,000 - .2100,000
=310,000

If you do pay X to determine the land’s value, you dont buy the land if its worth less than $200,000 (so 20% of the time you don’t buy)

Average profit will be .5600 + .3100,000
=330,000

You will not be willing to pay more than $20,000 for an appraisal [330,000-310,000) or else you will be at a loss

28
Q

Suppose a life insurance company sells a $240,000 policy with a one year term to a 24 year old woman for $240. The estimated probability that she survives the year is .999562. What is the expected value of this policy for the insurance company?

A

P(company gains money) = .999562
Amount of money the company gains = $240

P(company loses money) = .000438
Amount of money company loses = [240,000-240] = $239,760

Expected value of policy = ($240.999562)-(240,000.000438)
= 134.7748

29
Q

Suppose a disease has a 42% death rate. What’s the probability that exactly 4 out of 12 randomly selected patients survive?

A

Binomial Distribution
P(X =x) = (n p) p^x (1-p)^(n-x)
(12 4) (.42)(1-.42)^12-4

30
Q

A roulette wheel has 38 slots, 18 are red, 18 are black and 2 are green. You play 42 games and always bet on red. What is the probability that you win all 42 games?

A

(18/32)^42

31
Q

Walk me through the steps on how you would set up an A/B test

A

Define the Objective:
-choose one metric to focus on and state hypothesis

Create the Control and Test

  • control: feature or website you want to test against
  • ex. control might be the website you have right now and the test is another website that has something different and you want to see if that difference is significant

Collect the Data:

  • split sample size equally and randomly
  • then record outcomes

-usually number of users participating in A/B test is a small portion of the total users; the sample size you decide on will determine how long you will have to wait until you have collected enough data

Analyze the Results:
-accep or reject the null hypothesis and determine if the results were significant enough

32
Q

Compare R and Python

A

R:

  • focuses on better, user friendly data analysis, statistics and graphical models
  • large number of packages
  • mainly used when analysis is performed on a single server

Python

  • interpreted and object oriented language
  • general purpose
  • huge ecosystem and community support
  • simple and easy to understand
33
Q

What libraries for data analysis do you know in Python/R and what are they used for?

A

R:

  • dplyr: data manipulation, wrangling
  • plyr: data manipuation, great for splitting data a part
  • ggplot2: creating data visualizaiton
  • tidyverse: great package for data science, includes ggplot2, dplyr, tidyr, etc
  • caret: classification and regression training

Python:

  • pandas: data analysis, manipulation tool, helps organize data in tabular form
  • numpy: working with arrays
  • Scikit-learn: machine learning
  • matplotlib: creating graphs and charts
34
Q

What are constraints in SQL

A

-contstraints: used to specify the rules concerning data in the table and can be applied for single or multiple fields in a table.

constraints include
-NOT NULL
-CHECK [verifies all values in field satisfy a condition]
-DEFAULT [automatically assigns a default value if no value has been specified for the field]
-UNIQUE [ensures unique values to be inserted into field]
-PRIMARY KEY [uniquely identifies each record in a table]
FOREIGN KEY: [ensures referential integrity for a record in another table

35
Q

What is a primary key?

A
  • uniquely identifies each row in a table
  • unique value and not null
  • restricted to only having only one primary key which is comprised of single or multiple fields
36
Q

What is a foreign key?

A
  • comprises of a single or collection of fields in a table that refers to the primary key in another table and used to link two tables together
  • table containing foreign key is called child table
  • table containing candidate key is called parent (reference) table
37
Q

Super Key

A

-group of single or multiple keys which identifies rows in a table
-may have additional attributes that are not needed for unique identification
ex EMPSSN and EmpNum are superkeys

38
Q

What is a join?

A

Join combine records(rows) from two or more tables in a SQL database on a related column between the tables:

  • INNER: retrieves records that have matching values in both tables involved in the join
  • Left Outer Join: retrieves al lrecords from the left table and the matched records from the right table

Right Outer Join: retrieves all records from the right table and the matched records from the left table

Full Outer Join: retrieves all records where there is a match in either the left or right table

39
Q

what is a self join?

A

special case of regular join where a table is joined to itself based on some relation between its own column(s) and uses an inner join or left join and a table alias(to assign different names to the table within the query)

40
Q

What is a query?

A

-request for data or information from a database table or multiple tables

41
Q

What is a subquery?

A

-query within another query (nested querey) and is used to return data to the main query as a condition to restrict the data to be retrieved

42
Q

Let’s say that your company is running a standard control and variant AB test on a feature to increase conversion rates on the landing page. The PM checks the results and finds a .04 p-value.

How would you assess the validity of the result?

A

It is always important to clarify assumptions about the question upfront. In this particular question, clarifying the context of how the AB test was set up and measured will specifically draw out the solutions that the interviewer wants to hear.

If we have an AB test to analyze, there are two main ways in which we can look for invalidity. We could likely re-phrase the question to: How do you set up and measure an AB test correctly?

Let’s start out by answering the first part of figuring out the validity of the set up of the AB test.

  1. How were the user groups separated?

Can we determine that the control and variant groups were sampled accordingly to the test conditions? If we’re testing changes to a landing page to increase conversion, can we compare the two different users in the groups to see different metrics in which the distributions should look the same?

For example, if the groups were randomly bucketed, does the distribution of traffic from different attribution channels still look similar or is the variant A traffic channel coming primarily from facebook ads and the variant B from email? If testing group B has more traffic coming from email then that could be a biased test.

  1. Were the variants equal in all other aspects?

The outside world often has a much larger effect on metrics than product changes do. Users can behave very differently depending on the day of week, the time of year, the weather (especially in the case of a travel company like Airbnb), or whether they learned about the website through an online ad or found the site organically.

If the variants A’s landing page has a picture of the Eifel Tower and the submit button on the top of the page, and variant B’s landing page has a large picture of an ugly man and the submit button on the bottom of the page, then we could get conflicting results based on the change to multiple features.

Measurement

Looking at the actual measurement of the p-value, we understand that industry standard is .05, which means that 19 out of 20 times that we perform that test, we’re going to be correct that there is a difference between the populations. However, we have to note a couple of things about the test in the measurement process.

What was the sample size of the test?
Additionally, how long did it take before the product manager measured the p-value?
Lastly, how did the product manager measure the p-value and did they do so by continually monitoring the test?
If the product manager ran a T-test with a small sample size, they could very well easily get a p-value under 0.05. Many times, the source of confusion in AB testing is how much time you need to make a conclusion about the results of an experiment.

The problem with using the p-value as a stopping criterion is that the statistical test that gives you a p-value assumes that you designed the experiment with a sample and effect size in mind. If we continuously monitor the development of a test and the resulting p-value, we are very likely to see an effect, even if there is none. The opposite error is also common when you stop an experiment too early, before an effect becomes visible.

The number one most important reason is that we are performing a statistical test every time you compute a p-value and the more you do it, the more likely you are to find an effect.

How long should we recommend an experiment to run for then? To prevent a false negative (a Type II error), the best practice is to determine the minimum effect size that we care about and compute, based on the sample size (the number of new samples that come every day) and the certainty you want, how long to run the experiment for, before starting the experiment.