Machine learning Flashcards
using pandas to open csv files
pandas is a software library that can be used to open and display csv files - it must be imported
We use panda.read_csv() to read the file
and filename.describe() to show the data
filename.columns shows the column names
removing missing values using pandas
use the filename.dropna(axis = 0)
Selecting the prediction target using pandas
The prediction target is the column we want to predict
we select it using the dot notation
prediction target is called y
Features
These are columns that are inputted into the model and later used to make predictions
The list of features is denoted by X
e.g X = data_name[features] where features is a list of the column names we want to use
The steps to building and using a machine learning model
Define - what type of model is going to be used?
Fit - capture patterns from provided data
Predict
Evaluate - determine how accurate the predictions are
Using scikit-learn to produce a decision tree model
from sklearn.tree import DecisionTreeRegressor
model_name = DecisionTreeRegressor(random_state = 1)
model_name.fit(X, y)
print(X.head)
print(model_name.predict(X.head))
The predict function uses the model to make predictions from the given data
Why do we specify random_state?
Machine learning models allow some degree of randomness
Specifying a number ensures the results are the same each run
the dataname.head() panda method
This returns the top 5 rows
Mean Absolute Error
This is an average measure of how far off the model was from the real value
syntax is as follows:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(actualvalues, predictedvalues)
e.g.
model_name.fit(train_X, train_y)
val_predictions = model_name.predict(val_X)
print(mean_absolute_error(val_y, val_predictions)
If we are using a single data set for training and testing out model how do you ensure some of the data isn’t used in making the model so it can be used as validation data?
Use the train_test_split function this randomly splits the data into data for training and data for testing
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
model_name.fit(train_X, train_y)
predicted = model_name.predict(val_X)
Overfitting
When we use a deep decision tree with many splits - it ends up capturing spurious patterns in the data set
The model matches training data perfectly but it does poorly when faced with validation data or new data
Underfitting
When we use an extremely shallow email with few splits
The model fails to capture the main patterns and distinctions so it performs poorly even in the training data
Underfitting and Overfitting what do we want?
We want the sweet-spot between these
max_leaf_nodes
An argument in the DecisionTreeRegressor class
It describes the maximum number of leaves in a decision tree
This can be used to fine tune the model to ensure we aren’t under or overfitting
model_name.predict()
takes one argument val_X
where val_X is the validation features that we want to use to predict val_y
this returns an array of val_y
DecisionTreeRegressor model
Makes use of a decision tree to determine patterns from training data and then make predictions based off of those patterns
DecisionTreeRegressor() is a class imported from the sklearn.tree module from scikit-learn
It takes arguments such as max_leaf_nodes and random_state
RandomForestRegressor model
Makes use of multiple decision trees and makes predictions by averaging the predictions of each component tree
It generally makes better predictions than a single decision tree and doesn’t usually require the same amount of fine tuning
It is imported from the sklearn.ensemble module of scikit-learn
e.g. model_name = RandomForestRegressor(random_state = 1)
Dealing with missing data - imputation
Imputation replaces the missing data with the mean value of that column
from sklearn.impute import SimpleImputer
# Imputation my_imputer = SimpleImputer() imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train)) imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
# Imputation removed column names; put them back imputed_X_train.columns = X_train.columns imputed_X_valid.columns = X_valid.columns
Vectorisation
This is the way a computer transforms words of a sentence into numbers
the computer gives each word an index and then each sentence is represented by an array in which the number at each index is the number of times the word represented by the index occurs
e.g. ‘nice pizza is nice’ , ‘what is pizza’ pizza:0, is:1, nice:2, what:3
so [1, 1, 1, 1, 0] and [1, 1, 0, 1]
Each sentence vector is always the same length of the total vocabulary size - called a ‘bag of words’ and we no longer know the order of the sentence
To do this we use the CountVectorizer() function from sci-kit learn
from sklearn.feature_extraction.text import CountVectorizer
Classifier - decision trees
A classifier is a statistical model that tries to predict a label for a given input
A machine learning classifier can be trained - if we give it labelled data it can learn learn rules based off that data
The simplest of these is a decision tree, it uses a set of rules with yes/no answers in a tree structure
When the input makes it to a leaf node it acquires a label