Terminology Flashcards

1
Q

Overfitting

A

A phenomenon where a model does very well on training data but does poorly during validation or on new data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Underfitting

A

When a model does very poorly because it failed to capture important features or distinctions even on the training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Validation set

A

A subset of the training data set used to test the early accuracy of a model during the tuning stages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Training set

A

A data set used to fit the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Testing set

A

data set used to provide an unbiased evaluation of a model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Model fitting

A

Approximation of data to a target function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Decision tree regression

A

Similar to decision tree but used to find a continuous value and mean squared error is used to determine the number of splits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Imputation

A

The filling of missing values in a dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Categorical attributes

A

Values that fall into a set of ‘categories’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are techniques are used to deal with categorical data?

A

1 ) dropping categorical columns.

2) Label encoding
3) One-hot encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Label encoding

A

Assigning a unique integer to a categorical value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

One-hot encoding

A

creation of new columns for each unique categorical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Pipeline

A

automation of workflow to bundle preprocessing and modeling together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Cross validation

A

Subsets of training data used to provide a more accurate reading

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Variance

A

How different the results are when the model is tested on new data sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Data leakage

A

When training data contains some feature related to the target in such a way that the model has great accuracy on the training data but does poorly to unseen data