Data Science Flashcards

1
Q

Why pre-process raw data?

A

To drop/filter out data that has missing values or that appears to be incorrect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Different data types in pandas

A

-int64 (integers)
-float64 (decimals)
-object (text strings)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define feature engineering

A

Create new features from existing ones

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why do exploratory data analysis?

A

-to uncover what is useful/interesting
-starts giving ideas on how to start building a model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define model

A

Function with input as feature values and output is a predicted target value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Model types

A

-regression (numeric target)
-classification (categorical - binary 2 classes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What makes a good model?

A

Learns from signal (patterns) and ignores noise (randomness)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define underfitting bias

A

Not complex enough to capture signal in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Define sample/selection bias

A

When data used to test/train a model is not representative enough of the population that is being applied to the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Define overfitting

A

Model is too complex and has mistaken noise for signal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can you tell a model has been overfitted?

A

The model will perform well on data it has seen before but badly on new data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to avoid overfitting?

A

Use a train/test split - give model training data but evaluate it on unseen testing data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Define MSE

A

Mean squared error - measure of average error in model’s predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Define RMSE

A

Root mean squared error - similar to standard deviation, measures distance from model. Square root of MSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Is RMSE always positive?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Define R^2

A

Coefficient of determination - how much variation in the target variable can be attributed to variation in input features

17
Q

Is R^2 always positive?

18
Q

Can you compare models built on different slices of the data for RMSE?

19
Q

Can you compare models built on different slices of the data for R^2

A

No - may be different amounts of natural variations initially

20
Q

How to calculate precision in a confusion matrix

A

Number in 1/1 column divided by number in 1/All column

21
Q

How to calculate recall in a confusion matrix

A

Number in 1/1 column divided by number in All/1 column

22
Q

Define precision

A

How often positive predictions are correct

23
Q

Define recall

A

How well the model identifies positive cases