Data Flashcards

1
Q

Data components

A

data modeling
data infrastructure
statistical models
machine learning

heuristic vs. machine learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Productizing Machine learning models

A

Model flexibility

  • Prototype with real data
  • Backtesting
  • Create components

Code quality

  • Iteration
  • Version control

Computational performance

  • Run time
  • Output frequency

Connectivity
- Databases and API
- Data validation
scored in real-time prescored offline

Retraining models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Similarity between data scientist and PM

A

Decision making with data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

ML Model evaluation

A

1-
Process error rate, compare vs baseline.
Overfitting must be inexistant.

2- Make
trade-offs between time, cost and accuracy

3- Soft launch

4- Have a back up - degrading
(albeit probably less accurate) back-up model or even a rule-based system ready to be deployed in place of your model-of-choice when predictions go south.

https://towardsdatascience.com/on-being-a-data-science-product-manager-5c8baf42e0a7

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Root Mean Square Error

A

Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors).

Use to evaluate regression model against best fit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Predictive Model Markup Language

A

The PMML file format specifies the data fields to use for the model, the type of calculation to perform (regression), and the structure of the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Dataflow

A

Read
Apply
Write

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

P value

A

Determine if results of A/B test are reliable or not. Must be below 0.05

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Statistical significance

A

Degree of certainty of results. 95%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A/B test framework

A
1- Collect data
2- Identify goals
3- Generate hypothesis
4- Create variations
5- Run experiments
6- Analyze results.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Multivariate testing

A

The goal of multivariate testing is to determine which combination of variations performs the best out of all of the possible combinations.
Each combination tested is a new group -> more needed volume.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Minimum sample size

A

Minimum traffic required to reach statistical significance.

For example, if you want a 5% margin of error, your sample size will be approximately 1/(0.05²) = 400

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Margin of error

A

Accuracy of test results -> % chance that results is between x-5% and x+5% ( x = accuracy).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Machine learning models

A

Supervised

  • Classification
  • Linear regression

Unsupervised
- Clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly