Data Flashcards

Question 1

Q

Data components

Answer

A

data modeling
data infrastructure
statistical models
machine learning

heuristic vs. machine learning

Question 2

Q

Productizing Machine learning models

Answer

A

Model flexibility

Prototype with real data
Backtesting
Create components

Code quality

Iteration
Version control

Computational performance

Run time
Output frequency

Connectivity
- Databases and API
- Data validation
scored in real-time prescored offline

Retraining models

Question 3

Q

Similarity between data scientist and PM

Answer

A

Decision making with data.

Question 4

Q

ML Model evaluation

Answer

A

1-
Process error rate, compare vs baseline.
Overfitting must be inexistant.

2- Make
trade-offs between time, cost and accuracy

3- Soft launch

4- Have a back up - degrading
(albeit probably less accurate) back-up model or even a rule-based system ready to be deployed in place of your model-of-choice when predictions go south.

https://towardsdatascience.com/on-being-a-data-science-product-manager-5c8baf42e0a7

Question 5

Q

Root Mean Square Error

Answer

A

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).

Use to evaluate regression model against best fit.

Question 6

Q

Predictive Model Markup Language

Answer

A

The PMML file format specifies the data fields to use for the model, the type of calculation to perform (regression), and the structure of the model.

Question 7

Q

Dataflow

Answer

A

Read
Apply
Write

Question 8

Q

P value

Answer

A

Determine if results of A/B test are reliable or not. Must be below 0.05

Question 9

Q

Statistical significance

Answer

A

Degree of certainty of results. 95%

Question 10

Q

A/B test framework

Answer

A

1- Collect data
2- Identify goals
3- Generate hypothesis
4- Create variations
5- Run experiments
6- Analyze results.

Question 11

Q

Multivariate testing

Answer

A

The goal of multivariate testing is to determine which combination of variations performs the best out of all of the possible combinations.
Each combination tested is a new group -> more needed volume.

Question 12

Q

Minimum sample size

Answer

A

Minimum traffic required to reach statistical significance.

For example, if you want a 5% margin of error, your sample size will be approximately 1/(0.05²) = 400

Question 13

Q

Margin of error

Answer

A

Accuracy of test results -> % chance that results is between x-5% and x+5% ( x = accuracy).

Question 14

Q

Machine learning models

Answer

A

Supervised

Classification
Linear regression

Unsupervised
- Clustering

Data Flashcards

(14 cards)