Exploratory data analysis Flashcards
Find correlation between the following columns
df[[‘bore’, ‘stroke’, ‘compression-ratio’, ‘horsepower’]].corr()
Find the scatterplot of two columns
sns. regplot(x=”engine-size”, y=”price”, data=df)
plt. ylim(0,)
Find the boxplot of two columns
sns.boxplot(x=”body-style”, y=”price”, data=df)
Compute basic statistics for all variables
df.describe()
Value-counts is a good way of understanding how many units of each characteristic/variable we have.
df['drive-wheels'].value_counts() fwd 118 rwd 75 4wd 8 Name: drive-wheels, dtype: int64
convert the above results to data frame
and rename the columns
drive_wheels_counts = df[‘drive-wheels’].value_counts().to_frame()
drive_wheels_counts.rename(columns={‘drive-wheels’: ‘value_counts’}, inplace=True)
drive_wheels_counts
find out the distinct groups
df[‘drive-wheels’].unique()
Select multiple columns and assign one variable to it
This is the first step of grouping.
df_group_one = df[[‘drive-wheels’,’body-style’,’price’]]
Calculate the average price for each category
df_group_one = df_group_one.groupby([‘drive-wheels’],as_index=False).mean()
df_group_one
df_gptest = df[[‘drive-wheels’,’body-style’,’price’]]
grouped_test1 = df_gptest.groupby([‘drive-wheels’,’body-style’],as_index=False).mean()
grouped_test1
This groups the dataframe by the unique combinations ‘drive-wheels’ and ‘body-style’.
change it to pivot table style
grouped_pivot = grouped_test1.pivot(index=’drive-wheels’,columns=’body-style’)
grouped_pivot
Plot a heat map
plt. pcolor(grouped_pivot, cmap=’RdBu’)
plt. colorbar()
plt. show()
Calculate Pearson correlation coefficient and P-value
pearson_coef, p_value = stats.pearsonr(df[‘width’], df[‘price’])
print(“The Pearson Correlation Coefficient is”, pearson_coef, “ with a P-value of P =”, p_value )