Lesson 14 ML-SciKit Flashcards
Read in the data:
df = pd.read_csv(r”C:\Users\User\Documents\CFG_DATA\Data_files\BostonHousing.csv”, sep = “,”)
Plot a histogram of the target variable “medv”
df[‘medv’].hist(bins=15, figsize=(20,15))
Use seaborn to visually improve this histogram using the column df[‘medv’]
plt.figure(figsize = (15,10)) # matplotlib
sns.distplot(df[‘medv’], bins = 30) # seaborn
Use seaborn to create a heatmap of a correlation matrix.
plt.figure(figsize = (15,10))
correlation_matrix = df.corr()
sns.heatmap(data = correlation_matrix, annot = True)
plt.show()
Create a scatterplot to visualize the relationship between MEDV and LSTAT
Use df[‘lstat’] and df[‘medv’]
plt.figure(figsize = (11,11))
plt.scatter(df[‘lstat’], df[‘medv’], marker=’o’)
plt.xlabel(“LSAT”)
plt.ylabel(“MEDV”)
plt.show()
Write the code to import the code to build a model, and define the X and Y variable based on the column names [‘lstat’,’rm’ and ‘medv’ for Y.
from sklearn.model_selection import train_test_split
X = df[[‘lstat’, ‘rm’]] # our feature(s)
y = df[‘medv’] # our target variable
Split data into train and test parts
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=5)
Plot a histogram of the df using bins = 15 and figsize=20,15
df[‘medv’].hist(bins=15, figsize=(20,15))
Examine what you have so far for all the x_train, x_test, y_train, y_test
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Let’s now use scikit-learn to train our simple linear regression model.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
What does a linear regression model do?
A linear regression model calculates an equation that minimizes the distance
between the observed value and the predicted value.
Work out the proportion of salary income over and under 50K from the income column.
df[‘income’].value_counts()/len(df[‘income’])
Create a pairplot for the df
sns.pairplot(df)
Create a violin plot for the categorical features
categorical_features = [‘workclass’, ‘education’, ‘marital.status’, ‘occupation’, ‘relationship’, ‘race’, ‘sex’, ‘native.country’]
for feature in categorical_features:
plt.figure(figsize = (13,10))
ax = sns.violinplot(x=feature, y=”hours.per.week”, hue=”income”,
data=df, palette=”Set2”, split=True)
plt.xticks(rotation=90)
Modelling categorical data - need to convert the data to a number. Convert any income who is less than 50 to 0 and greater than 50k to 1.
df.loc[df[‘income’] == ‘<=50K’, ‘income’] = 0
df.loc[df[‘income’] == ‘>50K’, ‘income’] = 1
ensure that income is an integer
df[‘income’] = df[‘income’].astype(int)