Lesson 14 ML-SciKit Flashcards

1
Q

Read in the data:

df = pd.read_csv(r”C:\Users\User\Documents\CFG_DATA\Data_files\BostonHousing.csv”, sep = “,”)

Plot a histogram of the target variable “medv”

A

df[‘medv’].hist(bins=15, figsize=(20,15))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Use seaborn to visually improve this histogram using the column df[‘medv’]

A

plt.figure(figsize = (15,10)) # matplotlib
sns.distplot(df[‘medv’], bins = 30) # seaborn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Use seaborn to create a heatmap of a correlation matrix.

A

plt.figure(figsize = (15,10))
correlation_matrix = df.corr()

sns.heatmap(data = correlation_matrix, annot = True)
plt.show()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Create a scatterplot to visualize the relationship between MEDV and LSTAT

Use df[‘lstat’] and df[‘medv’]

A

plt.figure(figsize = (11,11))
plt.scatter(df[‘lstat’], df[‘medv’], marker=’o’)
plt.xlabel(“LSAT”)
plt.ylabel(“MEDV”)
plt.show()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Write the code to import the code to build a model, and define the X and Y variable based on the column names [‘lstat’,’rm’ and ‘medv’ for Y.

A

from sklearn.model_selection import train_test_split

X = df[[‘lstat’, ‘rm’]] # our feature(s)
y = df[‘medv’] # our target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Split data into train and test parts

A

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=5)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Plot a histogram of the df using bins = 15 and figsize=20,15

A

df[‘medv’].hist(bins=15, figsize=(20,15))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Examine what you have so far for all the x_train, x_test, y_train, y_test

A

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Let’s now use scikit-learn to train our simple linear regression model.

A

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does a linear regression model do?

A

A linear regression model calculates an equation that minimizes the distance
between the observed value and the predicted value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Work out the proportion of salary income over and under 50K from the income column.

A

df[‘income’].value_counts()/len(df[‘income’])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Create a pairplot for the df

A

sns.pairplot(df)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Create a violin plot for the categorical features

categorical_features = [‘workclass’, ‘education’, ‘marital.status’, ‘occupation’, ‘relationship’, ‘race’, ‘sex’, ‘native.country’]

A

for feature in categorical_features:
plt.figure(figsize = (13,10))
ax = sns.violinplot(x=feature, y=”hours.per.week”, hue=”income”,
data=df, palette=”Set2”, split=True)
plt.xticks(rotation=90)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Modelling categorical data - need to convert the data to a number. Convert any income who is less than 50 to 0 and greater than 50k to 1.

A

df.loc[df[‘income’] == ‘<=50K’, ‘income’] = 0
df.loc[df[‘income’] == ‘>50K’, ‘income’] = 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

ensure that income is an integer

A

df[‘income’] = df[‘income’].astype(int)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is one hot code?

A

One-hot encoding converts each categorical value into a new column and assigns a 1 or 0 (True/False) value to each row.

17
Q

What is an advantage and disadvantage of one hot code?

A

Advantage: “neutral” representation of the data (does not assign an order)
Disadvantage: can significantly increase the number of columns in the dataset