Communicating Results Flashcards
Summarizing descriptive statistics, plotting visualizations, drawing conclustions, and customizing visuals to communicate results
What is the .groupby() method
This allows you to group data by columns and aggregate info about groupings. The numeric_only excludes values that aren’t numeric.
df.groupby(“column_name”).mean(numeric_only=True)
or
df.groupby([“workclass”,”race”], as_index=False)[“capital-gain”].mean()
What is summation or .sum()?
It aggregates data vertically .sum(axis=0) or horizontally .sum(axis=1).
df_census[[“capital_gain”,”capital-loss”]].sum()
Visualize how to get the sum while using .groupby and then sort the values in descending order
df.groupby(by=”column”).sum(numeric_only=True).sort_values(by=”column2”, ascending=False)
What are the measures of center?
Mean = .mean()
Median = .medain()
Mode = .mode()
What is the mean?
It is the average or sum of all numbers in set/by number of values in the set
What is the median
The center value in a set. Always sort the values first then calculate the median.
What is the mode?
It is the value with the highest frequency in a set
If you had a column named color that contained the following values
df = { ‘color’ : [‘red,’ ‘blue,’ ‘red,’ ‘green,’ ‘blue,’ ‘blue’]}, what you would get if you used count as df[‘color’].value_counts()?
color : count
blue : 3
red : 2
green :1
What does df[‘col’].values do?
It returns a NumPy array containing all the values in the column, including duplicates, in the order they appear. It contains only the raw data. no column labels or indexes.
What does df.index do?
.index returns the row labels of a data frame or a series. You can use .index to view or inspect or change the row labels.
What does .columns do?
Provides an index object contianing the names of all the columns.
Why does df.index output: RangeIndex(start=0, stop=3, step=1)?
Pandas assigns a numerical RangeIndex starting from 0 when you create a DataFrame without specifying row labels.
This describes the range of the row indices in the DataFrame:
start=0: The first row index starts at 0.
stop=3: The range stops before 3, so the indices are [0, 1, 2].
step=1: The indices increment by 1 between rows.
Visualize how to:
- Use .index
to get the unique values.
- Use .values
to get the corresponding counts.
- Combine .index
and .values
to iterate through the value counts.
- Use .sort_index()
or .sort_values()
to sort as needed.
The .index gives you the unique values(categories)
unique_values = count.index
print(unique_values)
The counts themselves are accessed using .values
counts_values = counts.values
print(counts_values)
Sort the results by unique values
sorted_counts = counts.sort_index()
What is .zip()?
Zip is a python function that combines two or more iterables like lists, tuples or strings into a single iterator of tuples
ex.
list1 = [1,2,3]
list2 = [‘a’,’b’,’c’]
zipped = zip(list 1, list2)
print(list(zipped))
Output : [(1, ‘a’), (2, ‘b’), (3, ‘c’)]
What are the two ways to check if a column contains a value
A.
Uses the bitwise | to filter based on two conditions
var = df[(df[‘col’] == ‘data’) | (df[‘col’] == ‘data2’)]
B.
Uses the .isin() to check if the values exist in the column.
var = df[df[‘col’].isin([‘data1’, ‘data2’])]
What is the union of two events?
This calculates the probability that either event P(A) or P(B) happens (or both).
It’s P(A or B)
You add the probabilities of P(A) and P(B) because you’re counting all outcomes that include P(A) or P(B).
You subtract the overlap ((A∩B))
What is the formula for union of two events?
P (A or B) = P(A) + P(B) - P(A∩B)
What is the intersection of two events
It’s P(A and B), calculating that the events A and B happen at the same time
You only consider the overlap between P(A) and P(B)—where both events occur.
What is the formula for the intersection of two events?
P(A and B) = P(A∩B)
What is the conditional probability
This calculates the probability that A happens given B already happened. You’re “zooming in” on the subset of outcomes where P(B) happens and asking, “What proportion of those also include P(A)?”
What is the formula of conditional probablity
P(A | B) = (A∩B)/P(B)
Visualize how to find the count for unique values in a column?
filter for the unique values in the column
query_df = df.query(‘column == “data” or column == “data2”’)
#count the unique values
unique_column_value_count = query_df[‘column’].nunique()
Visualize how to convert the column names to lowercase
df[‘column’].str.lower()
Explain what each part of this code means
df.groupby(‘veh_class’).agg(mean_cmb_mpg=(‘cmb_mpg’,’mean’)).reset_index()
df.groupby(‘veh_class’) grows the rows by the veh_class column
mean_cmb_mpg is the name of the new column that will hold the mean values.
(‘cmb_mpg’, ‘mean’) specifies that you want to calculate the mean of the cmb_mpg column.
.reset_index() resets the index for the new dataframe created
T or F: you can use .agg() with NaNs
False. You have to use.dropna() or .fillna() first before calculating the mean.
Visualize how to calculate an increase for a column
Now, let’s merge the two DataFrames on ‘veh_class’ to compare them
Calculate mean for both years
veh_08 = df_08.groupby(‘veh_class’).agg(mean_cmb_mpg_08=(‘cmb_mpg’, ‘mean’)).reset_index()
veh_18 = df_18.groupby(‘veh_class’).agg(mean_cmb_mpg_18=(‘cmb_mpg’, ‘mean’)).reset_index()
inc = pd.merge(veh_08, veh_18, on=’veh_class’)
# Calculate the increase in fuel economy
inc[‘increase’] = inc[‘mean_cmb_mpg_18’] - inc[‘mean_cmb_mpg_08’]
# Display
inc
Explain what this code is doing?
pd.DataFrame(
{“year”: [“2008”, “2018”], “model_num”: [smart_08[“cmb_mpg”].mean(), smart_18[“cmb_mpg”].mean()]}
)
pd.DataFrame(…): This function creates a new DataFrame .
Dictionary Input: The DataFrame is constructed from a dictionary where:
“year”: This key corresponds to a list of years, [“2008”, “2018”].
“model_num”: This key corresponds to the average combined MPG for SmartWay vehicles for 2008 and 2018.
smart_08[“cmb_mpg”].mean(): This calculates the mean of the cmb_mpg column in the smart_08
smart_18[“cmb_mpg”].mean(): Similarly, this calculates the mean of the cmb_mpg column in the smart_18
Visualize how to convert “ 1970-01-01 00:00:00.000000162” to a regular datetime
Sample DataFrame with runtime as datetime
data = {‘runtime’: [‘1970-01-01 00:00:00.000000162’, ‘1970-01-01 00:00:00.000000120’, ‘1970-01-01 00:00:00.00000090’]}
df_m = pd.DataFrame(data)
# Convert ‘runtime’ to datetime
df_m[‘runtime’] = pd.to_datetime(df_m[‘runtime’])
# Function to convert datetime to “Xh Ym” format
def format_runtime(dt):
# Extract total seconds and convert to minutes
total_minutes = int(dt.timestamp() // 60) # Get total minutes
hours = total_minutes // 60 # Calculate hours
minutes = total_minutes % 60 # Calculate remaining minutes
return f”{hours}h {minutes}m” # Format as “Xh Ym”
# Apply the function to the ‘runtime’ column
df_m[‘runtime’] = df_m[‘runtime’].apply(format_runtime)
Visualize how to extract the name and genre from a library in a column [{“id”: 28, “name”: “Action”}
df_genre = df_m[‘genres’]
# Create a new DataFrame
df_g = pd.DataFrame({‘genres’: df_genre})
# Convert the string representation of lists to actual lists
df_g[‘genres’] = df_g[‘genres’].apply(ast.literal_eval)
# Extract the first genre’s dictionary
df_g[‘first_genre’] = df_g[‘genres’].apply(lambda x: x[0] if len(x) > 0 else None)
# Create separate columns for the genre name and id
df_g[‘first_genre_name’] = df_g[‘first_genre’].apply(lambda x: x[‘name’] if x is not None else None)
df_g[‘first_genre_id’] = df_g[‘first_genre’].apply(lambda x: x[‘id’] if x is not None else None)
# Display the resulting DataFrame
print(df_g[[‘first_genre_name’, ‘first_genre_id’]])
Visualize how to re-order data in a dataframe
df.reindex(columns=new_order)
Create a list with the desired column order.
Use that list to index the DataFrame or use the reindex() method.
This allows you to easily change the order of columns in your DataFrame as needed.
What is the ast.literal_eval() function?
Use ast.literal_eval to convert the string to a Python object
part of the ast (Abstract Syntax Trees). You can use it to convert strings to a Python object or a list of dictionaries that you can work with in your code
#import library
import ast
#String representation of a list of dictionaries
data_string = “[{‘name’: ‘Action’, ‘id’: 28}, {‘name’: ‘Adventure’, ‘id’: 12}]”
data_list = ast.literal_eval(data_string)
print(data_list)
# Output: [{‘name’: ‘Action’, ‘id’: 28}, {‘name’: ‘Adventure’, ‘id’: 12}]
how can you generate a random array with 25 rows and 5 columns?
note: make sure it rounds to 2 decimal points.
np.random.rand(25,5).round(decimals=2)
.tolist()
Converts a string representation of a list into a Python list
.eval()
Converts the string representation of a list into a python list
df[‘col1’] = “[0.53, 0.59, 0.45]”
df[‘col1’] = df[‘col1’].apply(lambda u: eval(u))
df[‘col1’] = [0.53, 0.59, 0.45]
Visualize how to explode a list into rows
df_explode = df.explode(column=”column_name”)
Visualize how to explode data into multiple columns
df_new = pd.DataFrame(df_explode[“column_name”].tolist(), columns=[“col1”,”col2”,”col3”])
What chart displays data in bins and shows frequency of observations in bins
A histogram
df.hist() or df[‘col_name’].hist()
What chart is good for categorical data?
Bar charts
df.bar() or df[‘col’].bar()
What chart shows the relationship between two numerical variables on a two-dimensional graph?
Scatter Plot
df.scatter() or df[‘col’].scatter()
pd.plotting.scatter_matrix()
What chart displays the distribution of numerical data in min, 1st quart., median, 3rd quart., and max
Box plot
df.box() or df[‘col’].box()
Visualize how to plot using the kind parameter
df.plot(x=”col1”, y=”col2”, kind=” “)
Visualize how to retrieve the lowest value of column 1 and find the corresponding value in column 2
.idxmin()
#Find the index of the minimum for column 1
lowest_value_index = df[‘column 1’].idxmin()
#Access the corresponding value in column 2 using that index
lowest_value = df.loc[lowest_value_index, ‘column 2’]
What function returns the index for the minimum value of a column?
.idxmin()
What parameter allows you to set the size of the figure or plot in Pandas?
figsize
.plot(figsize=(width,height))
what allows you to set the title for your Pandas plot?
df.plot(title=’title of plot.’)
How can you customize the labels of the x-axis and y-axis
df.plot(xlabel=’X axis label’)
df.plot(ylabel=’Y axis label’)
df.plot(ylim=(min_value, max_value)) #sets the min and max values for the y-axis
How do you change the color of the Pandas plot?
df.plot(color=’color_name’)
How do you provide the legend in a pandas plot
df.plot(legend=True)
Visualize the plotting of a bar chart in order
x = df[‘column’].value_counts().index
df[‘column’].value_counts()[x].plot(kind=’bar’);
Define the .value_count()
Counts the unique values in a series and returns a new series containing the counts of each unique value.
what would occur if you used
color = [blue, red, blue, blue, red, green]
df[‘color’].value_counts()
color | count
blue | 3
red | 2
green | 1
what would occur if you used
color = [blue, red, blue, blue, red, green]
value_counts = df[‘color’].value_counts()
Print(value_counts.index)
print(value_counts.values)
Index([‘blue’,’red’,’green’], dtype=’object’)
[3 2 1]
Visualize how to convert the data from .values to a Python list
values_numpy = df[‘column’].values
python_values = values_numpy.values.tolist()
print(python_values)
How do you filter rows based on multiple conditions?
Using the bitwise OR operator |
x = df[(df[‘column’] == ‘data) | (df_02[‘column’] == ‘data2’)]
Visualize how to use the .isin() to verify that a value is in a list of options
df[df[‘column’].isin([‘column_1’, ‘column_2’])]
What should do before calculating the mean on a column
Check for NaN values and drop them
df[‘column’].isnull().sum()
df.dropna(subset=[‘column’])
Visualize how to calculate the means of two columns for years 2024 and 2025 from two different dataframes so that you can calculate the increase
df_24[‘col_2024’]
df_25[‘col_2025’]
Calculate the mean for both years
value_24 = df24.groupby(‘data’).agg(mean_col_2024=(‘col_2024’,’mean’)).reset_index()
value_25= df25.groupby(‘data’).agg(mean_col_2025=(‘col_2025’,’mean’)).reset_index()
#merge the two dataframes on column ‘data’
merged_dfs = pd.merge(value_24, value_25, on=’data’)
#Calculate the increase
merged_dfs[‘increase’] = merged_dfs[‘mean_col_2024’] - merged_dfs[‘mean_col_2025’]
merged_dfs
What does this code do?
df.label.unique()
df refers to the dataframe
.label accesses a column with the label within the df dataframe.
.unique() returns an array of the unique values present in the label column
output:
array([‘no’,’yes’, dtype=object)
You can then use the labels to filter
df_filter = df.query(‘label == “yes”’)
When can you use data.column or “dot notation?”
ex. df.column.value_counts()
When the column name has no spaces, special characters, or numbers at the beginning
When the column name conflicts with Dataframe such as columns or values
When you want to use dynamic column access
col_name - ‘age’
print(df.col_name)
How can I convert a date and time string to a datetime datatype?
Sample DataFrame with runtime as datetime
pd.to_datetime()
Visualize using a function to convert a string “Xh Ym” to a hours and minutes
Sample DataFrame with runtime as datetime
data = {‘runtime’: [‘1970-01-01 00:00:00.000000162’, ‘1970-01-01 00:00:00.000000120’, ‘1970-01-01 00:00:00.00000090’]}
df_m = pd.DataFrame(data)
Convert ‘runtime’ to datetime
df_m[‘runtime’] = pd.to_datetime(df_m[‘runtime’])
Function to convert datetime to “Xh Ym” format
def format_runtime(dt):
# Extract total seconds and convert to minutes
total_minutes = int(dt.timestamp() // 60) # Get total minutes
hours = total_minutes // 60 # Calculate hours
minutes = total_minutes % 60 # Calculate remaining minutes
return f”{hours}h {minutes}m” # Format as “Xh Ym”
Apply the function to the ‘runtime’ column
df_m[‘runtime’] = df_m[‘runtime’].apply(format_runtime)
Visualize the steps to breaking up the below list into new columns from a dataframe titled df_m with the column labeled ‘genres’
[{“id”: 28, “name”: “Action”},{“id”: 12, “name”: “Adventure”}]
Create a new DataFrame
import pandas as pd
import ast
Assuming df_m is your original DataFrame
# Creating a new DataFrame to clean up the genre column
df_genre = df_m[‘genres’]
df_g = pd.DataFrame({‘genres’: df_genre})
# Converting the string representation of lists into actual lists
df_g[‘genres’] = df_g[‘genres’].apply(ast.literal_eval)
# Exploding the genres column to get each genre in a separate row
df_g = df_g.explode(‘genres’)
# Now extract the genre’s name and id into separate columns
df_g[‘genre_name’] = df_g[‘genres’].apply(lambda x: x[‘name’] if isinstance(x, dict) else None)
df_g[‘genre_id’] = df_g[‘genres’].apply(lambda x: x[‘id’] if isinstance(x, dict) else None)
# Dropping the original ‘genres’ column if you no longer need it
df_g = df_g.drop(columns=[‘genres’])
# Display the resulting DataFrame with genre names and IDs
print(df_g[[‘genre_name’, ‘genre_id’]])
How could you keep multiple descriptors for a value that would lead to producing duplicates
i.e. a movie with multiple genres, a person with multiple names, a car with multiple colors
Aggregate Information: If you want to summarize the information instead of listing all combinations, you could aggregate the data. For example, you could concatenate the genre names into a single string for each movie.
dataframe[‘genres’] = df_g.groupby(‘Movie Title’)[‘genre_name’].transform(lambda x: ‘, ‘.join(x))
df_g = df_g.drop_duplicates(subset=[‘Movie Title’]) # Keep only one row per movie
Visualize how to drop a row with a specific index
dataframe.drop(index=###, errors=’ignore’)