Machine learning Flashcards

1
Q

using pandas to open csv files

A

pandas is a software library that can be used to open and display csv files - it must be imported

We use panda.read_csv() to read the file

and filename.describe() to show the data

filename.columns shows the column names

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

removing missing values using pandas

A

use the filename.dropna(axis = 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Selecting the prediction target using pandas

A

The prediction target is the column we want to predict

we select it using the dot notation

prediction target is called y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Features

A

These are columns that are inputted into the model and later used to make predictions

The list of features is denoted by X

e.g X = data_name[features] where features is a list of the column names we want to use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The steps to building and using a machine learning model

A

Define - what type of model is going to be used?

Fit - capture patterns from provided data

Predict

Evaluate - determine how accurate the predictions are

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Using scikit-learn to produce a decision tree model

A

from sklearn.tree import DecisionTreeRegressor

model_name = DecisionTreeRegressor(random_state = 1)

model_name.fit(X, y)

print(X.head)

print(model_name.predict(X.head))

The predict function uses the model to make predictions from the given data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why do we specify random_state?

A

Machine learning models allow some degree of randomness

Specifying a number ensures the results are the same each run

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

the dataname.head() panda method

A

This returns the top 5 rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Mean Absolute Error

A

This is an average measure of how far off the model was from the real value

syntax is as follows:

from sklearn.metrics import mean_absolute_error

mean_absolute_error(actualvalues, predictedvalues)

e.g.
model_name.fit(train_X, train_y)
val_predictions = model_name.predict(val_X)
print(mean_absolute_error(val_y, val_predictions)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

If we are using a single data set for training and testing out model how do you ensure some of the data isn’t used in making the model so it can be used as validation data?

A

Use the train_test_split function this randomly splits the data into data for training and data for testing

from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

model_name.fit(train_X, train_y)

predicted = model_name.predict(val_X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Overfitting

A

When we use a deep decision tree with many splits - it ends up capturing spurious patterns in the data set

The model matches training data perfectly but it does poorly when faced with validation data or new data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Underfitting

A

When we use an extremely shallow email with few splits

The model fails to capture the main patterns and distinctions so it performs poorly even in the training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Underfitting and Overfitting what do we want?

A

We want the sweet-spot between these

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

max_leaf_nodes

A

An argument in the DecisionTreeRegressor class

It describes the maximum number of leaves in a decision tree

This can be used to fine tune the model to ensure we aren’t under or overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

model_name.predict()

A

takes one argument val_X
where val_X is the validation features that we want to use to predict val_y

this returns an array of val_y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

DecisionTreeRegressor model

A

Makes use of a decision tree to determine patterns from training data and then make predictions based off of those patterns

DecisionTreeRegressor() is a class imported from the sklearn.tree module from scikit-learn

It takes arguments such as max_leaf_nodes and random_state

17
Q

RandomForestRegressor model

A

Makes use of multiple decision trees and makes predictions by averaging the predictions of each component tree

It generally makes better predictions than a single decision tree and doesn’t usually require the same amount of fine tuning

It is imported from the sklearn.ensemble module of scikit-learn

e.g. model_name = RandomForestRegressor(random_state = 1)

18
Q

Dealing with missing data - imputation

A

Imputation replaces the missing data with the mean value of that column

from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns
19
Q

Vectorisation

A

This is the way a computer transforms words of a sentence into numbers

the computer gives each word an index and then each sentence is represented by an array in which the number at each index is the number of times the word represented by the index occurs

e.g. ‘nice pizza is nice’ , ‘what is pizza’ pizza:0, is:1, nice:2, what:3
so [1, 1, 1, 1, 0] and [1, 1, 0, 1]

Each sentence vector is always the same length of the total vocabulary size - called a ‘bag of words’ and we no longer know the order of the sentence

To do this we use the CountVectorizer() function from sci-kit learn

from sklearn.feature_extraction.text import CountVectorizer

20
Q

Classifier - decision trees

A

A classifier is a statistical model that tries to predict a label for a given input

A machine learning classifier can be trained - if we give it labelled data it can learn learn rules based off that data

The simplest of these is a decision tree, it uses a set of rules with yes/no answers in a tree structure

When the input makes it to a leaf node it acquires a label