Kaggel Flashcards

1
Q

How do we select the Prediction Targer?

A

We pull out a variable with dot-notation. This single column is stored in a Series, which is broadly like a DataFrame with only a single column of data.

We use the dot-notation to select the column we want to predict, which is called the prediction targer, it is also called y.

Example:
y = melbourne_data.Price

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

how do we choose the “Features”?

A

The columns that are inputted into our model are called “features”. In our case those would be the columns used to determine the home price. Sometimes we use all columns exept the target (y) other times we are better of with fewer features.

We select multiple features by providing a list of column names inside brackets, each item in that list should be a string
melbourne_features = [‘Rooms’, ‘Bathroom’, ‘Landsize’, ‘Lattitude’, ‘Longtitude’]

By convention, this data is called X
X = melbourne_data[melbourne_features]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can we quickly review the data we’ll be using to predict prises?

A

We use the describe() and head() method.
Example:
X.describe()
X.head()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the steps to bulding and using a model?

A

Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.

Fit: Capture patterns from provided data. This is the heart of modeling.

Predict: Just what it sounds like

Evaluate: Determine how accurate the model’s predictions are.

Example of defining a decision tree with scikit-learn and fitting it with features and target variable:

from sklearn.tree import DecisionTreeRegressor

Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

Fit model
melbourne_model.fit(X, y)

Show the predictions
predictions = melbourne_model.predict(X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the random_state=1 used for?

A

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won’t depend meaningfully on exactly what value you choose.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What doed MAE stand for?

A

Mean Absolute Error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do we calculate the prediction error?

A

error=actual-predicted
So if a house costs 150 and we predict it costs 100 the error is 50

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is validation data?

A

Since models’ practical value come from making predictions on new data, we measure performance on data that wasn’t used to build the model. The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model’s accuracy on data it hasn’t seen before.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do we code the mean_absolute_error?

A

from sklearn.model_selection import train_test_split

split data into training and validation data, for both features and target. The split is based on a random number generator. Supplying a numeric value to the random_state argument guarantees we get the same split every time we run this script.

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

Define model
melbourne_model = DecisionTreeRegressor()

Fit model
melbourne_model.fit(train_X, train_y)

get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain overfitting

A

Overfitting is where a model matches the training data almost perfectly, but does poorly in validation and other new data.

Overfitting: capturing spurious patterns that won’t recur in the future, leading to less accurate predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain underfitting

A

When a model has not learned the patterns in the training data well and is unable to generalize well on the new data, it is known as underfitting. An underfit model has poor performance on the training data and will result in unreliable predictions.

Underfitting: failing to capture relevant patterns, again leading to less accurate predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can we define a function for controlling the tree depth and provide a sensible way og controling overfitting vs underfitting?

A

We can use a utility function to help compare MAE scores from different values for max_leaf_nodes:

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes =
max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
pred_val = model.predict(val_X)
mae = mean_absolute_error(val_y, pred_val)
return (mae)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can we use a for loop to compare the accuracy of models built with different values for max_leaf_nodes?

A

for max_leaf_nodes in [5, 50, 500, 5000]:
my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y,
val_y)
print(f”Max leaf nodes: {max_leaf_nodes} — , Mean Absolute
Error: {my_mae}”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly