Lesson 14 ML-SciKit Flashcards

Question 1

Q

Read in the data:

df = pd.read_csv(r”C:\Users\User\Documents\CFG_DATA\Data_files\BostonHousing.csv”, sep = “,”)

Plot a histogram of the target variable “medv”

Answer

A

df[‘medv’].hist(bins=15, figsize=(20,15))

Question 2

Q

Use seaborn to visually improve this histogram using the column df[‘medv’]

Answer

A

plt.figure(figsize = (15,10)) # matplotlib
sns.distplot(df[‘medv’], bins = 30) # seaborn

Question 3

Q

Use seaborn to create a heatmap of a correlation matrix.

Answer

A

plt.figure(figsize = (15,10))
correlation_matrix = df.corr()

sns.heatmap(data = correlation_matrix, annot = True)
plt.show()

Question 4

Q

Create a scatterplot to visualize the relationship between MEDV and LSTAT

Use df[‘lstat’] and df[‘medv’]

Answer

A

plt.figure(figsize = (11,11))
plt.scatter(df[‘lstat’], df[‘medv’], marker=’o’)
plt.xlabel(“LSAT”)
plt.ylabel(“MEDV”)
plt.show()

Question 5

Q

Write the code to import the code to build a model, and define the X and Y variable based on the column names [‘lstat’,’rm’ and ‘medv’ for Y.

Answer

A

from sklearn.model_selection import train_test_split

X = df[[‘lstat’, ‘rm’]] # our feature(s)
y = df[‘medv’] # our target variable

Question 6

Q

Split data into train and test parts

Answer

A

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=5)

Question 7

Q

Plot a histogram of the df using bins = 15 and figsize=20,15

Answer

A

df[‘medv’].hist(bins=15, figsize=(20,15))

Question 8

Q

Examine what you have so far for all the x_train, x_test, y_train, y_test

Answer

A

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

Question 9

Q

Let’s now use scikit-learn to train our simple linear regression model.

Answer

A

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

Question 10

Q

What does a linear regression model do?

Answer

A

A linear regression model calculates an equation that minimizes the distance
between the observed value and the predicted value.

Question 11

Q

Work out the proportion of salary income over and under 50K from the income column.

Answer

A

df[‘income’].value_counts()/len(df[‘income’])

Question 12

Q

Create a pairplot for the df

Answer

A

sns.pairplot(df)

Question 13

Q

Create a violin plot for the categorical features

categorical_features = [‘workclass’, ‘education’, ‘marital.status’, ‘occupation’, ‘relationship’, ‘race’, ‘sex’, ‘native.country’]

Answer

A

for feature in categorical_features:
plt.figure(figsize = (13,10))
ax = sns.violinplot(x=feature, y=”hours.per.week”, hue=”income”,
data=df, palette=”Set2”, split=True)
plt.xticks(rotation=90)

Question 14

Q

Modelling categorical data - need to convert the data to a number. Convert any income who is less than 50 to 0 and greater than 50k to 1.

Answer

A

df.loc[df[‘income’] == ‘<=50K’, ‘income’] = 0
df.loc[df[‘income’] == ‘>50K’, ‘income’] = 1

Question 15

Q

ensure that income is an integer

Answer

A

df[‘income’] = df[‘income’].astype(int)

Question 16

Q

What is one hot code?

Answer

Study These Flashcards

A

One-hot encoding converts each categorical value into a new column and assigns a 1 or 0 (True/False) value to each row.

Question 17

Q

What is an advantage and disadvantage of one hot code?

Answer

Study These Flashcards

A

Advantage: “neutral” representation of the data (does not assign an order)
Disadvantage: can significantly increase the number of columns in the dataset

Lesson 14 ML-SciKit Flashcards

(17 cards)