Application-Based Data Science Questions Flashcards
Q: You randomly draw a coin from 100 coins — 1 unfair coin (head-head), 99 fair coins (head-tail) and roll it 10 times. If the result is 10 heads, what is the probability that the coin is unfair?
Assume that the probability of picking the unfair coin is denoted as P(A) and the probability of flipping 10 heads in a row is denoted as P(B). Then P(B|A) is equal to 1, P(B∣¬A) is equal to 0.⁵¹⁰, and P(¬A) is equal to 0.99.
If you fill in the equation, then P(A|B) = 0.9118 or 91.18%.
There is a building with 100 floors. You are given 2 identical eggs. How do you use 2 eggs to find the threshold floor, where the egg will definitely break from any floor above floor N, including floor N itself.
More specifically, the question is asking for the most optimal method of finding the threshold floor given two eggs.
To get a better understanding of the question, let’s assume that you only have one egg. To find the threshold floor, you would simply start at floor one, drop the egg, and go one floor higher at a time until the egg cracks.
Now imagine that we have unlimited eggs. The most optimal method in finding the threshold floor is through a binary search. First, you would start on the 50th floor. If the egg cracks then you would drop an egg on the 25th floor and if it doesn’t crack then you would drop an egg on the 75th floor, and you would repeat this process until you find the threshold floor.
With two eggs, the most optimal method in finding the threshold floor is a hybrid of the two solutions above…
For example, you could drop the first egg every 5 floors until it breaks and then use the second egg to find out which floor in between the increments of 5 the threshold floor is. In the worst-case scenario, this would take 24 drops.
If you dropped the first egg every 10 floors until it breaks, it would take 19 drops in the worst-case scenario, which is much better than dropping the first egg every 5 floors. But what if you wanted to do better?
This is where the concept, minimization of maximum regret comes into play. Basically, what this implies is that as you complete more drops at a given increment (how many floors you skip), you want to decrease the increment slowly each time, since there are less possible floors that the threshold floor can be. This means that if your first drop is on floor n then your second drop should be floor n + (n-1) assuming that it doesn’t break. This can be written as the following equation:
n+(n-1)+(n-2)+…+1 greater or equal to 100
To take it a step further, this can be simplified to:
n(n+1)/2 greater or equal to 100
Solving for n, you get approximately 14. Therefore, your strategy would be to start at floor 14, then 14+13, then 14+13+12, and so on until it breaks and then use the second egg to find the threshold floor one floor at a time.
Q: We have two options for serving ads within Newsfeed. Option 1: 1 out of every 25 stories, one will be ad. Option 2: every story has a 4% chance of being an ad. For each option, what is the expected number of ads shown in 100 news stories?
The expected number of odds for both options is 4 out of 100.
For Option 1, 1/25 is equivalent to 4/100.
For Option 2, 4% of 100 is 4/100.
Q: How do you prove that males are on average taller than females by knowing just gender height?
You can use hypothesis testing to prove that males are taller on average than females.
The null hypothesis would state that males and females are the same height on average, while the alternative hypothesis would state that the average height of males is greater than the average height of females.
Then you would collect a random sample of heights of males and females and use a t-test to determine if you reject the null or not.
Q: If 70% of Facebook users on iOS use Instagram, but only 35% of Facebook users on Android use Instagram, how would you investigate the discrepancy?
There are a number of possible variables that can cause such a discrepancy that I would check to see:
* The demographics of iOS and Android users might differ significantly. For example, according to Hootsuite, 43% of females use Instagram as opposed to 31% of men. If the proportion of female users for iOS is significantly larger than for Android then this can explain the discrepancy (or at least a part of it). This can also be said for age, race, ethnicity, location, etc… * Behavioral factors can also have an impact on the discrepancy. If iOS users use their phones more heavily than Android users, it’s more likely that they’ll indulge in Instagram and other apps than someone who spent significantly less time on their phones. * Another possible factor to consider is how Google Play and the App Store differ. For example, if Android users have significantly more apps (and social media apps) to choose from, that may cause greater dilution of users. * Lastly, any differences in the user experience can deter Android users from using Instagram compared to iOS users. If the app is more buggy for Android users than iOS users, they’ll be less likely to be active on the app.
Q: Likes/user and minutes spent on a platform are increasing but the total number of users is decreasing. What could be the root cause of it?
Generally, you would want to probe the interviewer for more information but let’s assume that this is the only information that he/she is willing to give.
Focusing on likes per user, there are two reasons why this would have gone up. The first reason is that the engagement of users has generally increased on average over time — this makes sense because as time passes, active users are more likely to be loyal users as using the platform becomes a habitual practice. The other reason why likes per user would increase is that the denominator, the total number of users, is decreasing. Assuming that users that stop using the platform are inactive users, aka users with little engagement and fewer likes than average, this would increase the average number of likes per user.
The explanation above can also be applied to minutes spent on the platform. Active users are becoming more engaged over time, while users with little usage are becoming inactive. Overall the increase in engagement outweighs the users with little engagement.
To take it a step further, it’s possible that the ‘users with little engagement’ are bots that Facebook has been able to detect. But over time, Facebook has been able to develop algorithms to spot and remove bots. If were a significant number of bots before, this can potentially be the root cause of this phenomenon.
Facebook sees that likes are up 10% year over year, why could this be?
The total number of likes in a given year is a function of the total number of users and the average number of likes per user (which I’ll refer to as engagement).
Some potential reasons for an increase in the total number of users are the following: users acquired due to international expansion and younger age groups signing up for Facebook as they get older.
Some potential reasons for an increase in engagement are an increase in usage of the app from users that are becoming more and more loyal, new features and functionality, and an improved user experience.
If a PM says that they want to double the number of ads in Newsfeed, how would you figure out if this is a good idea or not?
You can perform an A/B test by splitting the users into two groups: a control group with the normal number of ads and a test group with double the number of ads. Then you would choose the metric to define what a “good idea” is. For example, we can say that the null hypothesis is that doubling the number of ads will reduce the time spent on Facebook and the alternative hypothesis is that doubling the number of ads won’t have any impact on the time spent on Facebook. However, you can choose a different metric like the number of active users or the churn rate. Then you would conduct the test and determine the statistical significance of the test to reject or not reject the null.
There’s a game where you are given two fair six-sided dice and asked to roll. If the sum of the values on the dice equals seven, then you win $21. However, you must pay $5 to play each time you roll both dice. Do you play this game?
The odds of rolling a 7 is 1/6.
This means that you are expected to pay $30 (5*6) to win $21.
Take these two numbers and the expected payout is -$9 (21–30).
Since the expected payout is negative, you would not want to pay this game
If there are 8 marbles of equal weight and 1 marble that weighs a little bit more (for a total of 9 marbles), how many weighings are required to determine which marble is the heaviest?
Two weighings would be required (see part A and B above):
1. You would split the nine marbles into three groups of three and weigh two of the groups. If the scale balances (alternative 1), you know that the heavy marble is in the third group of marbles. Otherwise, you’ll take the group that is weighed more heavily (alternative 2).
Then you would exercise the same step, but you’d have three groups of one marble instead of three groups of three.
Difference between convex and non-convex cost function; what does it mean when a cost function is non-convex?
A convex function is one where a line drawn between any two points on the graph lies on or above the graph. It has one minimum.
A non-convex function is one where a line drawn between any two points on the graph may intersect other points on the graph. It characterized as “wavy”.
When a cost function is non-convex, it means that there’s a likelihood that the function may find local minima instead of the global minimum, which is typically undesired in machine learning models from an optimization perspective.
What is overfitting?
Overfitting is an error where the model ‘fits’ the data too well, resulting in a model with high variance and low bias. As a consequence, an overfit model will inaccurately predict new data points even though it has a high accuracy on the training data.
How would the change of prime membership fee affect the market?
Let’s take the instance where there’s an increase in the prime membership fee — there are two parties involved, the buyers and the sellers.
For the buyers, the impact of an increase in a prime membership fee ultimately depends on the price elasticity of demand for the buyers. If the price elasticity is high, then a given increase in price will result in a large drop in demand and vice versa. Buyers that continue to purchase a membership fee are likely Amazon’s most loyal and active customers — they are also likely to place a higher emphasis on products with prime.
Sellers will take a hit, as there is now a higher cost of purchasing Amazon’s basket of products. That being said, some products will take a harder hit while others may not be impacted. It is likely that premium products that Amazon’s most loyal customers purchase would not be affected as much, like electronics.
Describe decision trees, SVMs, and random forests. Talk about their advantage and disadvantages.
Decision Trees: a tree-like model used to model decisions based on one or more conditions.
• Pros: easy to implement, intuitive, handles missing values
• Cons: high variance, inaccurate
Support Vector Machines: a classification technique that finds a hyperplane or a boundary between the two classes of data that maximizes the margin between the two classes. There are many planes that can separate the two classes, but only one plane can maximize the margin or distance between the classes.
• Pros: accurate in high dimensionality
• Cons: prone to over-fitting, does not directly provide probability estimates
Random Forests: an ensemble learning technique that builds off of decision trees. Random forests involve creating multiple decision trees using bootstrapped datasets of the original data and randomly selecting a subset of variables at each step of the decision tree. The model then selects the mode of all of the predictions of each decision tree.
• Pros: can achieve higher accuracy, handle missing values, feature scaling not required, can determine feature importance.
Cons: black box, computationally intensive
Why is dimension reduction important?
Dimensionality reduction is the process of reducing the number of features in a dataset. This is important mainly in the case when you want to reduce variance in your model (overfitting).
Wikipedia states four advantages of dimensionality reduction (see here):
1. It reduces the time and storage space required
2. Removal of multi-collinearity improves the interpretation of the parameters of the machine learning model
3. It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D
4. It avoids the curse of dimensionality