Communicating Results Flashcards

Summarizing descriptive statistics, plotting visualizations, drawing conclustions, and customizing visuals to communicate results

1
Q

What is the .groupby() method

A

This allows you to group data by columns and aggregate info about groupings. The numeric_only excludes values that aren’t numeric.

df.groupby(“column_name”).mean(numeric_only=True)
or
df.groupby([“workclass”,”race”], as_index=False)[“capital-gain”].mean()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is summation or .sum()?

A

It aggregates data vertically .sum(axis=0) or horizontally .sum(axis=1).

df_census[[“capital_gain”,”capital-loss”]].sum()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Visualize how to get the sum while using .groupby and then sort the values in descending order

A

df.groupby(by=”column”).sum(numeric_only=True).sort_values(by=”column2”, ascending=False)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the measures of center?

A

Mean = .mean()
Median = .medain()
Mode = .mode()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the mean?

A

It is the average or sum of all numbers in set/by number of values in the set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the median

A

The center value in a set. Always sort the values first then calculate the median.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the mode?

A

It is the value with the highest frequency in a set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

If you had a column named color that contained the following values
df = { ‘color’ : [‘red,’ ‘blue,’ ‘red,’ ‘green,’ ‘blue,’ ‘blue’]}, what you would get if you used count as df[‘color’].value_counts()?

A

color : count
blue : 3
red : 2
green :1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does df[‘col’].values do?

A

It returns a NumPy array containing all the values in the column, including duplicates, in the order they appear. It contains only the raw data. no column labels or indexes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does df.index do?

A

.index returns the row labels of a data frame or a series. You can use .index to view or inspect or change the row labels.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does .columns do?

A

Provides an index object contianing the names of all the columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why does df.index output: RangeIndex(start=0, stop=3, step=1)?

A

Pandas assigns a numerical RangeIndex starting from 0 when you create a DataFrame without specifying row labels.

This describes the range of the row indices in the DataFrame:
start=0: The first row index starts at 0.
stop=3: The range stops before 3, so the indices are [0, 1, 2].
step=1: The indices increment by 1 between rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Visualize how to:
- Use .index to get the unique values.
- Use .values to get the corresponding counts.
- Combine .index and .values to iterate through the value counts.
- Use .sort_index() or .sort_values() to sort as needed.

A

The .index gives you the unique values(categories)
unique_values = count.index
print(unique_values)
The counts themselves are accessed using .values
counts_values = counts.values
print(counts_values)
Sort the results by unique values
sorted_counts = counts.sort_index()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is .zip()?

A

Zip is a python function that combines two or more iterables like lists, tuples or strings into a single iterator of tuples

ex.
list1 = [1,2,3]
list2 = [‘a’,’b’,’c’]
zipped = zip(list 1, list2)
print(list(zipped))
Output : [(1, ‘a’), (2, ‘b’), (3, ‘c’)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the two ways to check if a column contains a value

A

A.
Uses the bitwise | to filter based on two conditions
var = df[(df[‘col’] == ‘data’) | (df[‘col’] == ‘data2’)]
B.
Uses the .isin() to check if the values exist in the column.
var = df[df[‘col’].isin([‘data1’, ‘data2’])]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the union of two events?

A

This calculates the probability that either event P(A) or P(B) happens (or both).
It’s P(A or B)
You add the probabilities of P(A) and P(B) because you’re counting all outcomes that include P(A) or P(B).
You subtract the overlap ((A∩B))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the formula for union of two events?

A

P (A or B) = P(A) + P(B) - P(A∩B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the intersection of two events

A

It’s P(A and B), calculating that the events A and B happen at the same time
You only consider the overlap between P(A) and P(B)—where both events occur.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the formula for the intersection of two events?

A

P(A and B) = P(A∩B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is the conditional probability

A

This calculates the probability that A happens given B already happened. You’re “zooming in” on the subset of outcomes where P(B) happens and asking, “What proportion of those also include P(A)?”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the formula of conditional probablity

A

P(A | B) = (A∩B)/P(B)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Visualize how to find the count for unique values in a column?

A

filter for the unique values in the column

query_df = df.query(‘column == “data” or column == “data2”’)
#count the unique values
unique_column_value_count = query_df[‘column’].nunique()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Visualize how to convert the column names to lowercase

A

df[‘column’].str.lower()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Explain what each part of this code means

df.groupby(‘veh_class’).agg(mean_cmb_mpg=(‘cmb_mpg’,’mean’)).reset_index()

A

df.groupby(‘veh_class’) grows the rows by the veh_class column
mean_cmb_mpg is the name of the new column that will hold the mean values.
(‘cmb_mpg’, ‘mean’) specifies that you want to calculate the mean of the cmb_mpg column.
.reset_index() resets the index for the new dataframe created

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

T or F: you can use .agg() with NaNs

A

False. You have to use.dropna() or .fillna() first before calculating the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Visualize how to calculate an increase for a column

A

Now, let’s merge the two DataFrames on ‘veh_class’ to compare them

Calculate mean for both years
veh_08 = df_08.groupby(‘veh_class’).agg(mean_cmb_mpg_08=(‘cmb_mpg’, ‘mean’)).reset_index()
veh_18 = df_18.groupby(‘veh_class’).agg(mean_cmb_mpg_18=(‘cmb_mpg’, ‘mean’)).reset_index()
inc = pd.merge(veh_08, veh_18, on=’veh_class’)
# Calculate the increase in fuel economy
inc[‘increase’] = inc[‘mean_cmb_mpg_18’] - inc[‘mean_cmb_mpg_08’]
# Display
inc

27
Q

Explain what this code is doing?

pd.DataFrame(
{“year”: [“2008”, “2018”], “model_num”: [smart_08[“cmb_mpg”].mean(), smart_18[“cmb_mpg”].mean()]}
)

A

pd.DataFrame(…): This function creates a new DataFrame .
Dictionary Input: The DataFrame is constructed from a dictionary where:
“year”: This key corresponds to a list of years, [“2008”, “2018”].
“model_num”: This key corresponds to the average combined MPG for SmartWay vehicles for 2008 and 2018.
smart_08[“cmb_mpg”].mean(): This calculates the mean of the cmb_mpg column in the smart_08
smart_18[“cmb_mpg”].mean(): Similarly, this calculates the mean of the cmb_mpg column in the smart_18

28
Q

Visualize how to convert “ 1970-01-01 00:00:00.000000162” to a regular datetime

A

Sample DataFrame with runtime as datetime
data = {‘runtime’: [‘1970-01-01 00:00:00.000000162’, ‘1970-01-01 00:00:00.000000120’, ‘1970-01-01 00:00:00.00000090’]}
df_m = pd.DataFrame(data)
# Convert ‘runtime’ to datetime
df_m[‘runtime’] = pd.to_datetime(df_m[‘runtime’])
# Function to convert datetime to “Xh Ym” format
def format_runtime(dt):
# Extract total seconds and convert to minutes
total_minutes = int(dt.timestamp() // 60) # Get total minutes
hours = total_minutes // 60 # Calculate hours
minutes = total_minutes % 60 # Calculate remaining minutes
return f”{hours}h {minutes}m” # Format as “Xh Ym”
# Apply the function to the ‘runtime’ column
df_m[‘runtime’] = df_m[‘runtime’].apply(format_runtime)

29
Q

Visualize how to extract the name and genre from a library in a column [{“id”: 28, “name”: “Action”}

A

df_genre = df_m[‘genres’]
# Create a new DataFrame
df_g = pd.DataFrame({‘genres’: df_genre})
# Convert the string representation of lists to actual lists
df_g[‘genres’] = df_g[‘genres’].apply(ast.literal_eval)
# Extract the first genre’s dictionary
df_g[‘first_genre’] = df_g[‘genres’].apply(lambda x: x[0] if len(x) > 0 else None)
# Create separate columns for the genre name and id
df_g[‘first_genre_name’] = df_g[‘first_genre’].apply(lambda x: x[‘name’] if x is not None else None)
df_g[‘first_genre_id’] = df_g[‘first_genre’].apply(lambda x: x[‘id’] if x is not None else None)
# Display the resulting DataFrame
print(df_g[[‘first_genre_name’, ‘first_genre_id’]])

30
Q

Visualize how to re-order data in a dataframe

A

df.reindex(columns=new_order)

Create a list with the desired column order.
Use that list to index the DataFrame or use the reindex() method.
This allows you to easily change the order of columns in your DataFrame as needed.

31
Q

What is the ast.literal_eval() function?

A

Use ast.literal_eval to convert the string to a Python object

part of the ast (Abstract Syntax Trees). You can use it to convert strings to a Python object or a list of dictionaries that you can work with in your code
#import library
import ast
#String representation of a list of dictionaries
data_string = “[{‘name’: ‘Action’, ‘id’: 28}, {‘name’: ‘Adventure’, ‘id’: 12}]”
data_list = ast.literal_eval(data_string)
print(data_list)
# Output: [{‘name’: ‘Action’, ‘id’: 28}, {‘name’: ‘Adventure’, ‘id’: 12}]

32
Q

how can you generate a random array with 25 rows and 5 columns?
note: make sure it rounds to 2 decimal points.

A

np.random.rand(25,5).round(decimals=2)

33
Q

.tolist()

A

Converts a string representation of a list into a Python list

34
Q

.eval()

A

Converts the string representation of a list into a python list

df[‘col1’] = “[0.53, 0.59, 0.45]”

df[‘col1’] = df[‘col1’].apply(lambda u: eval(u))
df[‘col1’] = [0.53, 0.59, 0.45]

35
Q

Visualize how to explode a list into rows

A

df_explode = df.explode(column=”column_name”)

36
Q

Visualize how to explode data into multiple columns

A

df_new = pd.DataFrame(df_explode[“column_name”].tolist(), columns=[“col1”,”col2”,”col3”])

37
Q

What chart displays data in bins and shows frequency of observations in bins

A

A histogram
df.hist() or df[‘col_name’].hist()

38
Q

What chart is good for categorical data?

A

Bar charts
df.bar() or df[‘col’].bar()

39
Q

What chart shows the relationship between two numerical variables on a two-dimensional graph?

A

Scatter Plot
df.scatter() or df[‘col’].scatter()
pd.plotting.scatter_matrix()

40
Q

What chart displays the distribution of numerical data in min, 1st quart., median, 3rd quart., and max

A

Box plot
df.box() or df[‘col’].box()

41
Q

Visualize how to plot using the kind parameter

A

df.plot(x=”col1”, y=”col2”, kind=” “)

42
Q

Visualize how to retrieve the lowest value of column 1 and find the corresponding value in column 2

A

.idxmin()
#Find the index of the minimum for column 1
lowest_value_index = df[‘column 1’].idxmin()
#Access the corresponding value in column 2 using that index
lowest_value = df.loc[lowest_value_index, ‘column 2’]

43
Q

What function returns the index for the minimum value of a column?

A

.idxmin()

44
Q

What parameter allows you to set the size of the figure or plot in Pandas?

A

figsize

.plot(figsize=(width,height))

45
Q

what allows you to set the title for your Pandas plot?

A

df.plot(title=’title of plot.’)

46
Q

How can you customize the labels of the x-axis and y-axis

A

df.plot(xlabel=’X axis label’)
df.plot(ylabel=’Y axis label’)
df.plot(ylim=(min_value, max_value)) #sets the min and max values for the y-axis

47
Q

How do you change the color of the Pandas plot?

A

df.plot(color=’color_name’)

48
Q

How do you provide the legend in a pandas plot

A

df.plot(legend=True)

49
Q

Visualize the plotting of a bar chart in order

A

x = df[‘column’].value_counts().index
df[‘column’].value_counts()[x].plot(kind=’bar’);

50
Q

Define the .value_count()

A

Counts the unique values in a series and returns a new series containing the counts of each unique value.

51
Q

what would occur if you used
color = [blue, red, blue, blue, red, green]
df[‘color’].value_counts()

A

color | count
blue | 3
red | 2
green | 1

52
Q

what would occur if you used
color = [blue, red, blue, blue, red, green]
value_counts = df[‘color’].value_counts()
Print(value_counts.index)
print(value_counts.values)

A

Index([‘blue’,’red’,’green’], dtype=’object’)
[3 2 1]

53
Q

Visualize how to convert the data from .values to a Python list

A

values_numpy = df[‘column’].values
python_values = values_numpy.values.tolist()
print(python_values)

54
Q

How do you filter rows based on multiple conditions?

A

Using the bitwise OR operator |

x = df[(df[‘column’] == ‘data) | (df_02[‘column’] == ‘data2’)]

55
Q

Visualize how to use the .isin() to verify that a value is in a list of options

A

df[df[‘column’].isin([‘column_1’, ‘column_2’])]

56
Q

What should do before calculating the mean on a column

A

Check for NaN values and drop them
df[‘column’].isnull().sum()
df.dropna(subset=[‘column’])

57
Q

Visualize how to calculate the means of two columns for years 2024 and 2025 from two different dataframes so that you can calculate the increase

df_24[‘col_2024’]
df_25[‘col_2025’]

A

Calculate the mean for both years

value_24 = df24.groupby(‘data’).agg(mean_col_2024=(‘col_2024’,’mean’)).reset_index()
value_25= df25.groupby(‘data’).agg(mean_col_2025=(‘col_2025’,’mean’)).reset_index()
#merge the two dataframes on column ‘data’
merged_dfs = pd.merge(value_24, value_25, on=’data’)
#Calculate the increase
merged_dfs[‘increase’] = merged_dfs[‘mean_col_2024’] - merged_dfs[‘mean_col_2025’]

merged_dfs

58
Q

What does this code do?

df.label.unique()

A

df refers to the dataframe
.label accesses a column with the label within the df dataframe.
.unique() returns an array of the unique values present in the label column
output:
array([‘no’,’yes’, dtype=object)
You can then use the labels to filter
df_filter = df.query(‘label == “yes”’)

59
Q

When can you use data.column or “dot notation?”

ex. df.column.value_counts()

A

When the column name has no spaces, special characters, or numbers at the beginning
When the column name conflicts with Dataframe such as columns or values
When you want to use dynamic column access
col_name - ‘age’
print(df.col_name)

60
Q

How can I convert a date and time string to a datetime datatype?

A

Sample DataFrame with runtime as datetime

pd.to_datetime()

61
Q

Visualize using a function to convert a string “Xh Ym” to a hours and minutes

A

Sample DataFrame with runtime as datetime
data = {‘runtime’: [‘1970-01-01 00:00:00.000000162’, ‘1970-01-01 00:00:00.000000120’, ‘1970-01-01 00:00:00.00000090’]}
df_m = pd.DataFrame(data)

Convert ‘runtime’ to datetime
df_m[‘runtime’] = pd.to_datetime(df_m[‘runtime’])

Function to convert datetime to “Xh Ym” format
def format_runtime(dt):
# Extract total seconds and convert to minutes
total_minutes = int(dt.timestamp() // 60) # Get total minutes
hours = total_minutes // 60 # Calculate hours
minutes = total_minutes % 60 # Calculate remaining minutes
return f”{hours}h {minutes}m” # Format as “Xh Ym”

Apply the function to the ‘runtime’ column
df_m[‘runtime’] = df_m[‘runtime’].apply(format_runtime)

62
Q

Visualize the steps to breaking up the below list into new columns from a dataframe titled df_m with the column labeled ‘genres’
[{“id”: 28, “name”: “Action”},{“id”: 12, “name”: “Adventure”}]

A

Create a new DataFrame

import pandas as pd
import ast

Assuming df_m is your original DataFrame
# Creating a new DataFrame to clean up the genre column
df_genre = df_m[‘genres’]
df_g = pd.DataFrame({‘genres’: df_genre})
# Converting the string representation of lists into actual lists
df_g[‘genres’] = df_g[‘genres’].apply(ast.literal_eval)
# Exploding the genres column to get each genre in a separate row
df_g = df_g.explode(‘genres’)
# Now extract the genre’s name and id into separate columns
df_g[‘genre_name’] = df_g[‘genres’].apply(lambda x: x[‘name’] if isinstance(x, dict) else None)
df_g[‘genre_id’] = df_g[‘genres’].apply(lambda x: x[‘id’] if isinstance(x, dict) else None)
# Dropping the original ‘genres’ column if you no longer need it
df_g = df_g.drop(columns=[‘genres’])
# Display the resulting DataFrame with genre names and IDs
print(df_g[[‘genre_name’, ‘genre_id’]])

63
Q

How could you keep multiple descriptors for a value that would lead to producing duplicates
i.e. a movie with multiple genres, a person with multiple names, a car with multiple colors

A

Aggregate Information: If you want to summarize the information instead of listing all combinations, you could aggregate the data. For example, you could concatenate the genre names into a single string for each movie.
dataframe[‘genres’] = df_g.groupby(‘Movie Title’)[‘genre_name’].transform(lambda x: ‘, ‘.join(x))
df_g = df_g.drop_duplicates(subset=[‘Movie Title’]) # Keep only one row per movie

64
Q

Visualize how to drop a row with a specific index

A

dataframe.drop(index=###, errors=’ignore’)