Data Science Flashcards

Question

Given 2 fair dice, what's the probability of getting socres that sum to 4? to 7?

Answer 1

When you roll 2 dice, there are 36 possible combinations: -number of totals that sum up to 4 are three (1,3), (2,2), (3,1) = 3/36 = 1/12 -number of totals that sum up to 7 are {1,6),(2,5),(3,4)(4,3)(5,2)(6,1) = 6/36 = 1/6

Answer 2

Bayes Theorem -P(double headed coin | 10 coins) = (P(10 Heads | Double-headed coin) * P(Double headed coin)) / (P(10 heads | double headed coin) *P(double headed coin) + P(10 heads | fair coin) P(fair coin)) - P(10 heads | double headed coin) = 1 - P(double headed coin) = 1/1000 = .001 - P(10 heads | fair coin) = (1/2)^1 * 0 - P(fair coin) = 999/1000 -plug into equation/formula above: (1*.001) / (1* 0.001 + .99 * .5^2) =.5062

Answer 3

50% of the time you will make: $600,000 ($800,000-200,000) 30% of the time you will make: $100,000($300,000-200,000) 20% of the time you will lose: -$100,000(100,000-200,000) The value of the contract without an appraisal is .5 * 600,000 + .3*100,000 - .2*100,000 =310,000 If you do pay X to determine the land's value, you dont buy the land if its worth less than $200,000 (so 20% of the time you don't buy) Average profit will be .5*600 + .3*100,000 =330,000 You will not be willing to pay more than $20,000 for an appraisal [330,000-310,000) or else you will be at a loss

Answer 4

P(company gains money) = .999562 Amount of money the company gains = $240 P(company loses money) = .000438 Amount of money company loses = [240,000-240] = $239,760 Expected value of policy = ($240*.999562)-(240,000*.000438) = 134.7748

Answer 5

Binomial Distribution P(X =x) = (n p) p^x (1-p)^(n-x) (12 4) (.42)(1-.42)^12-4

Answer 6

(18/32)^42

Answer 7

Define the Objective: -choose one metric to focus on and state hypothesis Create the Control and Test - control: feature or website you want to test against - ex. control might be the website you have right now and the test is another website that has something different and you want to see if that difference is significant Collect the Data: - split sample size equally and randomly - then record outcomes -usually number of users participating in A/B test is a small portion of the total users; the sample size you decide on will determine how long you will have to wait until you have collected enough data Analyze the Results: -accep or reject the null hypothesis and determine if the results were significant enough

Answer 8

R: - focuses on better, user friendly data analysis, statistics and graphical models - large number of packages - mainly used when analysis is performed on a single server Python - interpreted and object oriented language - general purpose - huge ecosystem and community support - simple and easy to understand

Answer 9

R: - dplyr: data manipulation, wrangling - plyr: data manipuation, great for splitting data a part - ggplot2: creating data visualizaiton - tidyverse: great package for data science, includes ggplot2, dplyr, tidyr, etc - caret: classification and regression training Python: - pandas: data analysis, manipulation tool, helps organize data in tabular form - numpy: working with arrays - Scikit-learn: machine learning - matplotlib: creating graphs and charts

Answer 10

-contstraints: used to specify the rules concerning data in the table and can be applied for single or multiple fields in a table. constraints include -NOT NULL -CHECK [verifies all values in field satisfy a condition] -DEFAULT [automatically assigns a default value if no value has been specified for the field] -UNIQUE [ensures unique values to be inserted into field] -PRIMARY KEY [uniquely identifies each record in a table] FOREIGN KEY: [ensures referential integrity for a record in another table

Answer 11

- uniquely identifies each row in a table - unique value and not null - restricted to only having only one primary key which is comprised of single or multiple fields

Answer 12

- comprises of a single or collection of fields in a table that refers to the primary key in another table and used to link two tables together - table containing foreign key is called child table - table containing candidate key is called parent (reference) table

Answer 13

-group of single or multiple keys which identifies rows in a table -may have additional attributes that are not needed for unique identification ex EMPSSN and EmpNum are superkeys

Answer 14

Join combine records(rows) from two or more tables in a SQL database on a related column between the tables: - INNER: retrieves records that have matching values in both tables involved in the join - Left Outer Join: retrieves al lrecords from the left table and the matched records from the right table Right Outer Join: retrieves all records from the right table and the matched records from the left table Full Outer Join: retrieves all records where there is a match in either the left or right table

Answer 15

special case of regular join where a table is joined to itself based on some relation between its own column(s) and uses an inner join or left join and a table alias(to assign different names to the table within the query)

Answer 16

-request for data or information from a database table or multiple tables

Answer 17

-query within another query (nested querey) and is used to return data to the main query as a condition to restrict the data to be retrieved

Answer 18

It is always important to clarify assumptions about the question upfront. In this particular question, clarifying the context of how the AB test was set up and measured will specifically draw out the solutions that the interviewer wants to hear. If we have an AB test to analyze, there are two main ways in which we can look for invalidity. We could likely re-phrase the question to: How do you set up and measure an AB test correctly? Let's start out by answering the first part of figuring out the validity of the set up of the AB test. 1. How were the user groups separated? Can we determine that the control and variant groups were sampled accordingly to the test conditions? If we're testing changes to a landing page to increase conversion, can we compare the two different users in the groups to see different metrics in which the distributions should look the same? For example, if the groups were randomly bucketed, does the distribution of traffic from different attribution channels still look similar or is the variant A traffic channel coming primarily from facebook ads and the variant B from email? If testing group B has more traffic coming from email then that could be a biased test. 2. Were the variants equal in all other aspects? The outside world often has a much larger effect on metrics than product changes do. Users can behave very differently depending on the day of week, the time of year, the weather (especially in the case of a travel company like Airbnb), or whether they learned about the website through an online ad or found the site organically. If the variants A's landing page has a picture of the Eifel Tower and the submit button on the top of the page, and variant B's landing page has a large picture of an ugly man and the submit button on the bottom of the page, then we could get conflicting results based on the change to multiple features. Measurement Looking at the actual measurement of the p-value, we understand that industry standard is .05, which means that 19 out of 20 times that we perform that test, we're going to be correct that there is a difference between the populations. However, we have to note a couple of things about the test in the measurement process. What was the sample size of the test? Additionally, how long did it take before the product manager measured the p-value? Lastly, how did the product manager measure the p-value and did they do so by continually monitoring the test? If the product manager ran a T-test with a small sample size, they could very well easily get a p-value under 0.05. Many times, the source of confusion in AB testing is how much time you need to make a conclusion about the results of an experiment. The problem with using the p-value as a stopping criterion is that the statistical test that gives you a p-value assumes that you designed the experiment with a sample and effect size in mind. If we continuously monitor the development of a test and the resulting p-value, we are very likely to see an effect, even if there is none. The opposite error is also common when you stop an experiment too early, before an effect becomes visible. The number one most important reason is that we are performing a statistical test every time you compute a p-value and the more you do it, the more likely you are to find an effect. How long should we recommend an experiment to run for then? To prevent a false negative (a Type II error), the best practice is to determine the minimum effect size that we care about and compute, based on the sample size (the number of new samples that come every day) and the certainty you want, how long to run the experiment for, before starting the experiment.

Data Science Flashcards

(42 cards)