Data Science Flashcards

Question 1

Q

Why pre-process raw data?

Answer

A

To drop/filter out data that has missing values or that appears to be incorrect

Question 2

Q

Different data types in pandas

Answer

A

-int64 (integers)
-float64 (decimals)
-object (text strings)

Question 3

Q

Define feature engineering

Answer

A

Create new features from existing ones

Question 4

Q

Why do exploratory data analysis?

Answer

A

-to uncover what is useful/interesting
-starts giving ideas on how to start building a model

Question 5

Q

Define model

Answer

A

Function with input as feature values and output is a predicted target value

Question 6

Q

Model types

Answer

A

-regression (numeric target)
-classification (categorical - binary 2 classes)

Question 7

Q

What makes a good model?

Answer

A

Learns from signal (patterns) and ignores noise (randomness)

Question 8

Q

Define underfitting bias

Answer

A

Not complex enough to capture signal in the data

Question 9

Q

Define sample/selection bias

Answer

A

When data used to test/train a model is not representative enough of the population that is being applied to the model

Question 10

Q

Define overfitting

Answer

A

Model is too complex and has mistaken noise for signal

Question 11

Q

How can you tell a model has been overfitted?

Answer

A

The model will perform well on data it has seen before but badly on new data

Question 12

Q

How to avoid overfitting?

Answer

A

Use a train/test split - give model training data but evaluate it on unseen testing data

Question 13

Q

Define MSE

Answer

A

Mean squared error - measure of average error in model’s predictions

Question 14

Q

Define RMSE

Answer

A

Root mean squared error - similar to standard deviation, measures distance from model. Square root of MSE

Question 15

Q

Is RMSE always positive?

Question 16

Q

Define R^2

Answer

Study These Flashcards

A

Coefficient of determination - how much variation in the target variable can be attributed to variation in input features

Question 17

Q

Is R^2 always positive?

Answer

Study These Flashcards

A

No

Question 18

Q

Can you compare models built on different slices of the data for RMSE?

Answer

Study These Flashcards

A

Yes

Question 19

Q

Can you compare models built on different slices of the data for R^2

Answer

Study These Flashcards

A

No - may be different amounts of natural variations initially

Question 20

Q

How to calculate precision in a confusion matrix

Answer

Study These Flashcards

A

Number in 1/1 column divided by number in 1/All column

Question 21

Q

How to calculate recall in a confusion matrix

Answer

Study These Flashcards

A

Number in 1/1 column divided by number in All/1 column

Question 22

Q

Define precision

Answer

Study These Flashcards

A

How often positive predictions are correct

Question 23

Q

Define recall

Answer

Study These Flashcards

A

How well the model identifies positive cases

Data Science Flashcards

(23 cards)