Terminology Flashcards
Overfitting
A phenomenon where a model does very well on training data but does poorly during validation or on new data
Underfitting
When a model does very poorly because it failed to capture important features or distinctions even on the training set
Validation set
A subset of the training data set used to test the early accuracy of a model during the tuning stages.
Training set
A data set used to fit the model.
Testing set
data set used to provide an unbiased evaluation of a model.
Model fitting
Approximation of data to a target function.
Decision tree regression
Similar to decision tree but used to find a continuous value and mean squared error is used to determine the number of splits.
Imputation
The filling of missing values in a dataset
Categorical attributes
Values that fall into a set of ‘categories’
What are techniques are used to deal with categorical data?
1 ) dropping categorical columns.
2) Label encoding
3) One-hot encoding
Label encoding
Assigning a unique integer to a categorical value
One-hot encoding
creation of new columns for each unique categorical
Pipeline
automation of workflow to bundle preprocessing and modeling together
Cross validation
Subsets of training data used to provide a more accurate reading
Variance
How different the results are when the model is tested on new data sets.