Exploring data Flashcards
Learning how to explore data after you clean it.
Visualize how to get the variance of a dataframe using groupby
df.groupby(by=”col1”)[[“col2”,”col3”,”col4”]].var()
Visualize how to use .describe() on groups to get measurements by the percentiles parameter.
df.groupby(by=”col1”)[[“col2”,”col3”,”col4”]].describe(percentiles=[0.25,0.5,0.75])
What is a Histogram? Visualize how to create.
df.plot(kind=”hist”)
It displays the distribution of numerical data
It divides data into bins and shows frequency of observations in each bin
Pandas: Visualize how to create a bar chart
df.plot(kind=”bar”)
It compares different categories and shows values as bars of various lengths.
matplotlib: Visualize how to create a pie chart
labels = “L1”, “L2”, “L3”
sizes = [10,20,25]
fig, ax = plt.subplots()
ax.pie(sizes, labels=labels, autopct=’%1.1f%%’, pctdistance=1.25, labeldistance=.6, colors=[“C1”,”C2”,C3”])
Use pctdistance and labeldistance if you want the percentages outside of the pct.
A. Visualize how you case use aggregation on all columns
B. visualize how you can use .agg() on specific columns
C. Visualize how you can use .agg() using .groupby()
D. Visualize how to rename columns with .aagg
A. df.agg([‘mean’, ‘sum’, ‘max’])
B. df.agg({ ‘col1’: ‘mean’, ‘col2’: [‘sum’, ‘min’], ‘col3’ : lambda x: x.std()})
C. df.groupby(‘col_group’).agg({‘col1’: ‘mean’, ‘col2’: ‘sum’, ‘col3’ : ‘max’})
D. df.groupby(‘group_column’).agg(mean_col1=(‘col1’, ‘mean’), sum_col2=(‘col2’, ‘sum’)
Visualize how to reset the index of a DataFrame
df.reset_index()
Visualize an example of how to use groupby to calculate mean
data = { ‘model’: [‘Car A’, ‘Car A’, ‘Car B’, ‘Car B’, ‘Car C’], ‘city_mpg’: [20, 22, 25, 27, 18]}
df = pd.DataFrame(data)
mean_mpg = df.groupby(‘model’)[‘city_mpg’].mean()
print(mean_mpg)
Output -
model city_mpg
Car A 21.0,
Car B 26.0 ,
Car C 18.0
A. Visualize how to calculate the mean on a dataframe.
B. Visualize how to calculate the mean on a column
A. df.mean(numeric_only=True)
B. df.groupby(“col”).mean(numeric_only=True)
What does standard deviation measure?
How much each point differs from the mean, or how spread out the data is.
.std()
What is variance?
Variance helps us understand how the numbers in a group differ from the average, giving a sense of how scattered or clustered the data is.
.var()
What are quantiles? And how do you use them?
Quantiles are values that split a group of data into equal parts.
df[[‘col1’,’col2’,’col3’]].qunatile(q=[.25,.50,.75,1])
You can change the percentages to be whatever you want.
What method would you use to show capital gains and capital loss?
dataframe[[“capital-gain”, “capital-loss”]].sum()
What are the different panda plotting methods?
A. df.hist(figsize=(#,#)); or df[col].hist(figsize=(8,8));
B. df.plot(kind=”box”, figsize=(#,#)) or df[col].plot(kind=”box”, figsize=(#,#))
C. df.bar() or df[col].bar()
D. df.pie() or df[col].pie()
E. pd.plotting.scatter_matrix()
F. df.scatter() or df[col].scatter()
G. df.box() or df[col].box()
How would you plot a bar chart with the value counts?
df[‘col’].value_counts().plot(kind=’bar’);