Machine learning Flashcards
using pandas to open csv files
pandas is a software library that can be used to open and display csv files - it must be imported
We use panda.read_csv() to read the file
and filename.describe() to show the data
filename.columns shows the column names
removing missing values using pandas
use the filename.dropna(axis = 0)
Selecting the prediction target using pandas
The prediction target is the column we want to predict
we select it using the dot notation
prediction target is called y
Features
These are columns that are inputted into the model and later used to make predictions
The list of features is denoted by X
e.g X = data_name[features] where features is a list of the column names we want to use
The steps to building and using a machine learning model
Define - what type of model is going to be used?
Fit - capture patterns from provided data
Predict
Evaluate - determine how accurate the predictions are
Using scikit-learn to produce a decision tree model
from sklearn.tree import DecisionTreeRegressor
model_name = DecisionTreeRegressor(random_state = 1)
model_name.fit(X, y)
print(X.head)
print(model_name.predict(X.head))
The predict function uses the model to make predictions from the given data
Why do we specify random_state?
Machine learning models allow some degree of randomness
Specifying a number ensures the results are the same each run
the dataname.head() panda method
This returns the top 5 rows
Mean Absolute Error
This is an average measure of how far off the model was from the real value
syntax is as follows:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(actualvalues, predictedvalues)
e.g.
model_name.fit(train_X, train_y)
val_predictions = model_name.predict(val_X)
print(mean_absolute_error(val_y, val_predictions)
If we are using a single data set for training and testing out model how do you ensure some of the data isn’t used in making the model so it can be used as validation data?
Use the train_test_split function this randomly splits the data into data for training and data for testing
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
model_name.fit(train_X, train_y)
predicted = model_name.predict(val_X)
Overfitting
When we use a deep decision tree with many splits - it ends up capturing spurious patterns in the data set
The model matches training data perfectly but it does poorly when faced with validation data or new data
Underfitting
When we use an extremely shallow email with few splits
The model fails to capture the main patterns and distinctions so it performs poorly even in the training data
Underfitting and Overfitting what do we want?
We want the sweet-spot between these
max_leaf_nodes
An argument in the DecisionTreeRegressor class
It describes the maximum number of leaves in a decision tree
This can be used to fine tune the model to ensure we aren’t under or overfitting
model_name.predict()
takes one argument val_X
where val_X is the validation features that we want to use to predict val_y
this returns an array of val_y