Lesson 17 Exploratory analysis Flashcards
Import the follow csv from this location, ensuring to specify that data is separated by delimiter ;
r’C:\Users\User\Documents\CFG_DATA\data\winequality-red.csv’
df = pd.read_csv(r’C:\Users\User\Documents\CFG_DATA\data\winequality-red.csv’, sep=’;’)
check labels for each column
df.columns.values
or
df.keys()
What is the number of rows?
Columns?
df.shape[0]
df.shape[1]
Check information for each column
df.info()
Return the unique values from a column called quality.
df.quality.unique()
Calculate the frequency of each unique value in the “quality” column of the DataFrame df (return a Series with the unique values as the index and their respective counts as the values.)
df.quality.value_counts()
Check for missing values using a heatmap
cbar is the colorbar
sns.heatmap(df.isnull(),cbar=False,yticklabels=False)
calculate attributes correlation
df.corr()
Build correlation heatmap
plt.figure(figsize=(6,4))
sns.heatmap(df.corr(),annot=False)
Increase the size of the heatmap.
plt.figure(figsize=(16, 6))
k = 12
specify the number of variables for the heatmap
Question I need to figure out why it is necessary to create this new heatmap of the correlation matrix
Quality correlation matrix
Increase the size of the heatmap.
plt.figure(figsize=(16, 6))
k = 12 # number of variables for heatmap
cols = df.corr().nlargest(k, ‘quality’)[‘quality’].index
cm = df[cols].corr()
sns.heatmap(cm, annot=True)
Create a boxplot
plt.boxplot(df_happy_gdp[‘Happiness_score’])
Set the title and labels
plt.title(‘Box Plot of Happiness Score’)
plt.xlabel(‘Happiness Score’)
plt.ylabel(‘Value’)
plt.show()