Machine learning Flashcards

Question 1

Q

using pandas to open csv files

Answer

A

pandas is a software library that can be used to open and display csv files - it must be imported

We use panda.read_csv() to read the file

and filename.describe() to show the data

filename.columns shows the column names

Question 2

Q

removing missing values using pandas

Answer

A

use the filename.dropna(axis = 0)

Question 3

Q

Selecting the prediction target using pandas

Answer

A

The prediction target is the column we want to predict

we select it using the dot notation

prediction target is called y

Question 4

Q

Features

Answer

A

These are columns that are inputted into the model and later used to make predictions

The list of features is denoted by X

e.g X = data_name[features] where features is a list of the column names we want to use

Question 5

Q

The steps to building and using a machine learning model

Answer

A

Define - what type of model is going to be used?

Fit - capture patterns from provided data

Predict

Evaluate - determine how accurate the predictions are

Question 6

Q

Using scikit-learn to produce a decision tree model

Answer

A

from sklearn.tree import DecisionTreeRegressor

model_name = DecisionTreeRegressor(random_state = 1)

model_name.fit(X, y)

print(X.head)

print(model_name.predict(X.head))

The predict function uses the model to make predictions from the given data

Question 7

Q

Why do we specify random_state?

Answer

A

Machine learning models allow some degree of randomness

Specifying a number ensures the results are the same each run

Question 8

Q

the dataname.head() panda method

Answer

A

This returns the top 5 rows

Question 9

Q

Mean Absolute Error

Answer

A

This is an average measure of how far off the model was from the real value

syntax is as follows:

from sklearn.metrics import mean_absolute_error

mean_absolute_error(actualvalues, predictedvalues)

e.g.
model_name.fit(train_X, train_y)
val_predictions = model_name.predict(val_X)
print(mean_absolute_error(val_y, val_predictions)

Question 10

Q

If we are using a single data set for training and testing out model how do you ensure some of the data isn’t used in making the model so it can be used as validation data?

Answer

A

Use the train_test_split function this randomly splits the data into data for training and data for testing

from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

model_name.fit(train_X, train_y)

predicted = model_name.predict(val_X)

Question 11

Q

Overfitting

Answer

A

When we use a deep decision tree with many splits - it ends up capturing spurious patterns in the data set

The model matches training data perfectly but it does poorly when faced with validation data or new data

Question 12

Q

Underfitting

Answer

A

When we use an extremely shallow email with few splits

The model fails to capture the main patterns and distinctions so it performs poorly even in the training data

Question 13

Q

Underfitting and Overfitting what do we want?

Answer

A

We want the sweet-spot between these

Question 14

Q

max_leaf_nodes

Answer

A

An argument in the DecisionTreeRegressor class

It describes the maximum number of leaves in a decision tree

This can be used to fine tune the model to ensure we aren’t under or overfitting

Question 15

Q

model_name.predict()

Answer

A

takes one argument val_X
where val_X is the validation features that we want to use to predict val_y

this returns an array of val_y

Question 16

Q

DecisionTreeRegressor model

Answer

Study These Flashcards

A

Makes use of a decision tree to determine patterns from training data and then make predictions based off of those patterns

DecisionTreeRegressor() is a class imported from the sklearn.tree module from scikit-learn

It takes arguments such as max_leaf_nodes and random_state

Question 17

Q

RandomForestRegressor model

Answer

Study These Flashcards

A

Makes use of multiple decision trees and makes predictions by averaging the predictions of each component tree

It generally makes better predictions than a single decision tree and doesn’t usually require the same amount of fine tuning

It is imported from the sklearn.ensemble module of scikit-learn

e.g. model_name = RandomForestRegressor(random_state = 1)

Question 18

Q

Dealing with missing data - imputation

Answer

Study These Flashcards

A

Imputation replaces the missing data with the mean value of that column

from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

Question 19

Q

Vectorisation

Answer

Study These Flashcards

A

This is the way a computer transforms words of a sentence into numbers

the computer gives each word an index and then each sentence is represented by an array in which the number at each index is the number of times the word represented by the index occurs

e.g. ‘nice pizza is nice’ , ‘what is pizza’ pizza:0, is:1, nice:2, what:3
so [1, 1, 1, 1, 0] and [1, 1, 0, 1]

Each sentence vector is always the same length of the total vocabulary size - called a ‘bag of words’ and we no longer know the order of the sentence

To do this we use the CountVectorizer() function from sci-kit learn

from sklearn.feature_extraction.text import CountVectorizer

Question 20

Q

Classifier - decision trees

Answer

Study These Flashcards

A

A classifier is a statistical model that tries to predict a label for a given input

A machine learning classifier can be trained - if we give it labelled data it can learn learn rules based off that data

The simplest of these is a decision tree, it uses a set of rules with yes/no answers in a tree structure

When the input makes it to a leaf node it acquires a label

Machine learning Flashcards

(20 cards)