Data Science Flashcards
Why pre-process raw data?
To drop/filter out data that has missing values or that appears to be incorrect
Different data types in pandas
-int64 (integers)
-float64 (decimals)
-object (text strings)
Define feature engineering
Create new features from existing ones
Why do exploratory data analysis?
-to uncover what is useful/interesting
-starts giving ideas on how to start building a model
Define model
Function with input as feature values and output is a predicted target value
Model types
-regression (numeric target)
-classification (categorical - binary 2 classes)
What makes a good model?
Learns from signal (patterns) and ignores noise (randomness)
Define underfitting bias
Not complex enough to capture signal in the data
Define sample/selection bias
When data used to test/train a model is not representative enough of the population that is being applied to the model
Define overfitting
Model is too complex and has mistaken noise for signal
How can you tell a model has been overfitted?
The model will perform well on data it has seen before but badly on new data
How to avoid overfitting?
Use a train/test split - give model training data but evaluate it on unseen testing data
Define MSE
Mean squared error - measure of average error in model’s predictions
Define RMSE
Root mean squared error - similar to standard deviation, measures distance from model. Square root of MSE
Is RMSE always positive?
Yes
Define R^2
Coefficient of determination - how much variation in the target variable can be attributed to variation in input features
Is R^2 always positive?
No
Can you compare models built on different slices of the data for RMSE?
Yes
Can you compare models built on different slices of the data for R^2
No - may be different amounts of natural variations initially
How to calculate precision in a confusion matrix
Number in 1/1 column divided by number in 1/All column
How to calculate recall in a confusion matrix
Number in 1/1 column divided by number in All/1 column
Define precision
How often positive predictions are correct
Define recall
How well the model identifies positive cases