Communicating Results Flashcards

Question 1

Q

What is the .groupby() method

Answer

A

This allows you to group data by columns and aggregate info about groupings. The numeric_only excludes values that aren’t numeric.

df.groupby(“column_name”).mean(numeric_only=True)
or
df.groupby([“workclass”,”race”], as_index=False)[“capital-gain”].mean()

Question 2

Q

What is summation or .sum()?

Answer

A

It aggregates data vertically .sum(axis=0) or horizontally .sum(axis=1).

df_census[[“capital_gain”,”capital-loss”]].sum()

Question 3

Q

Visualize how to get the sum while using .groupby and then sort the values in descending order

Answer

A

df.groupby(by=”column”).sum(numeric_only=True).sort_values(by=”column2”, ascending=False)

Question 4

Q

What are the measures of center?

Answer

A

Mean = .mean()
Median = .medain()
Mode = .mode()

Question 5

Q

What is the mean?

Answer

A

It is the average or sum of all numbers in set/by number of values in the set

Question 6

Q

What is the median

Answer

A

The center value in a set. Always sort the values first then calculate the median.

Question 7

Q

What is the mode?

Answer

A

It is the value with the highest frequency in a set

Question 8

Q

If you had a column named color that contained the following values
df = { ‘color’ : [‘red,’ ‘blue,’ ‘red,’ ‘green,’ ‘blue,’ ‘blue’]}, what you would get if you used count as df[‘color’].value_counts()?

Answer

A

color : count
blue : 3
red : 2
green :1

Question 9

Q

What does df[‘col’].values do?

Answer

A

It returns a NumPy array containing all the values in the column, including duplicates, in the order they appear. It contains only the raw data. no column labels or indexes.

Question 10

Q

What does df.index do?

Answer

A

.index returns the row labels of a data frame or a series. You can use .index to view or inspect or change the row labels.

Question 11

Q

What does .columns do?

Answer

A

Provides an index object contianing the names of all the columns.

Question 12

Q

Why does df.index output: RangeIndex(start=0, stop=3, step=1)?

Answer

A

Pandas assigns a numerical RangeIndex starting from 0 when you create a DataFrame without specifying row labels.

This describes the range of the row indices in the DataFrame:
start=0: The first row index starts at 0.
stop=3: The range stops before 3, so the indices are [0, 1, 2].
step=1: The indices increment by 1 between rows.

Question 13

Q

Visualize how to:
- Use .index to get the unique values.
- Use .values to get the corresponding counts.
- Combine .index and .values to iterate through the value counts.
- Use .sort_index() or .sort_values() to sort as needed.

Answer

A

The .index gives you the unique values(categories)
unique_values = count.index
print(unique_values)
The counts themselves are accessed using .values
counts_values = counts.values
print(counts_values)
Sort the results by unique values
sorted_counts = counts.sort_index()

Question 14

Q

What is .zip()?

Answer

A

Zip is a python function that combines two or more iterables like lists, tuples or strings into a single iterator of tuples

ex.
list1 = [1,2,3]
list2 = [‘a’,’b’,’c’]
zipped = zip(list 1, list2)
print(list(zipped))
Output : [(1, ‘a’), (2, ‘b’), (3, ‘c’)]

Question 15

Q

What are the two ways to check if a column contains a value

Answer

A

A.
Uses the bitwise | to filter based on two conditions
var = df[(df[‘col’] == ‘data’) | (df[‘col’] == ‘data2’)]
B.
Uses the .isin() to check if the values exist in the column.
var = df[df[‘col’].isin([‘data1’, ‘data2’])]

Question 16

Q

What is the union of two events?

Answer

A

This calculates the probability that either event P(A) or P(B) happens (or both).
It’s P(A or B)
You add the probabilities of P(A) and P(B) because you’re counting all outcomes that include P(A) or P(B).
You subtract the overlap ((A∩B))

Question 17

Q

What is the formula for union of two events?

Answer

A

P (A or B) = P(A) + P(B) - P(A∩B)

Question 18

Q

What is the intersection of two events

Answer

A

It’s P(A and B), calculating that the events A and B happen at the same time
You only consider the overlap between P(A) and P(B)—where both events occur.

Question 19

Q

What is the formula for the intersection of two events?

Answer

A

P(A and B) = P(A∩B)

Question 20

Q

What is the conditional probability

Answer

A

This calculates the probability that A happens given B already happened. You’re “zooming in” on the subset of outcomes where P(B) happens and asking, “What proportion of those also include P(A)?”

Question 21

Q

What is the formula of conditional probablity

Answer

A

P(A | B) = (A∩B)/P(B)

Question 22

Q

Visualize how to find the count for unique values in a column?

Answer

A

filter for the unique values in the column

query_df = df.query(‘column == “data” or column == “data2”’)
#count the unique values
unique_column_value_count = query_df[‘column’].nunique()

Question 23

Q

Visualize how to convert the column names to lowercase

Answer

A

df[‘column’].str.lower()

Question 24

Q

Explain what each part of this code means

df.groupby(‘veh_class’).agg(mean_cmb_mpg=(‘cmb_mpg’,’mean’)).reset_index()

Answer

A

df.groupby(‘veh_class’) grows the rows by the veh_class column
mean_cmb_mpg is the name of the new column that will hold the mean values.
(‘cmb_mpg’, ‘mean’) specifies that you want to calculate the mean of the cmb_mpg column.
.reset_index() resets the index for the new dataframe created

Question 25

Q

T or F: you can use .agg() with NaNs

Answer

A

False. You have to use.dropna() or .fillna() first before calculating the mean.

Question 26

Q

Visualize how to calculate an increase for a column

Answer

A

Now, let’s merge the two DataFrames on ‘veh_class’ to compare them

Calculate mean for both years
veh_08 = df_08.groupby(‘veh_class’).agg(mean_cmb_mpg_08=(‘cmb_mpg’, ‘mean’)).reset_index()
veh_18 = df_18.groupby(‘veh_class’).agg(mean_cmb_mpg_18=(‘cmb_mpg’, ‘mean’)).reset_index()
inc = pd.merge(veh_08, veh_18, on=’veh_class’)
# Calculate the increase in fuel economy
inc[‘increase’] = inc[‘mean_cmb_mpg_18’] - inc[‘mean_cmb_mpg_08’]
# Display
inc

Question 27

Q

Explain what this code is doing?

pd.DataFrame(
{“year”: [“2008”, “2018”], “model_num”: [smart_08[“cmb_mpg”].mean(), smart_18[“cmb_mpg”].mean()]}
)

Answer

A

pd.DataFrame(…): This function creates a new DataFrame .
Dictionary Input: The DataFrame is constructed from a dictionary where:
“year”: This key corresponds to a list of years, [“2008”, “2018”].
“model_num”: This key corresponds to the average combined MPG for SmartWay vehicles for 2008 and 2018.
smart_08[“cmb_mpg”].mean(): This calculates the mean of the cmb_mpg column in the smart_08
smart_18[“cmb_mpg”].mean(): Similarly, this calculates the mean of the cmb_mpg column in the smart_18

Question 28

Q

Visualize how to convert “ 1970-01-01 00:00:00.000000162” to a regular datetime

Answer

A

Sample DataFrame with runtime as datetime
data = {‘runtime’: [‘1970-01-01 00:00:00.000000162’, ‘1970-01-01 00:00:00.000000120’, ‘1970-01-01 00:00:00.00000090’]}
df_m = pd.DataFrame(data)
# Convert ‘runtime’ to datetime
df_m[‘runtime’] = pd.to_datetime(df_m[‘runtime’])
# Function to convert datetime to “Xh Ym” format
def format_runtime(dt):
# Extract total seconds and convert to minutes
total_minutes = int(dt.timestamp() // 60) # Get total minutes
hours = total_minutes // 60 # Calculate hours
minutes = total_minutes % 60 # Calculate remaining minutes
return f”{hours}h {minutes}m” # Format as “Xh Ym”
# Apply the function to the ‘runtime’ column
df_m[‘runtime’] = df_m[‘runtime’].apply(format_runtime)

Question 29

Q

Visualize how to extract the name and genre from a library in a column [{“id”: 28, “name”: “Action”}

Answer

A

df_genre = df_m[‘genres’]
# Create a new DataFrame
df_g = pd.DataFrame({‘genres’: df_genre})
# Convert the string representation of lists to actual lists
df_g[‘genres’] = df_g[‘genres’].apply(ast.literal_eval)
# Extract the first genre’s dictionary
df_g[‘first_genre’] = df_g[‘genres’].apply(lambda x: x[0] if len(x) > 0 else None)
# Create separate columns for the genre name and id
df_g[‘first_genre_name’] = df_g[‘first_genre’].apply(lambda x: x[‘name’] if x is not None else None)
df_g[‘first_genre_id’] = df_g[‘first_genre’].apply(lambda x: x[‘id’] if x is not None else None)
# Display the resulting DataFrame
print(df_g[[‘first_genre_name’, ‘first_genre_id’]])

Question 30

Q

Visualize how to re-order data in a dataframe

Answer

A

df.reindex(columns=new_order)

Create a list with the desired column order.
Use that list to index the DataFrame or use the reindex() method.
This allows you to easily change the order of columns in your DataFrame as needed.

Question 31

Q

What is the ast.literal_eval() function?

Answer

A

Use ast.literal_eval to convert the string to a Python object

part of the ast (Abstract Syntax Trees). You can use it to convert strings to a Python object or a list of dictionaries that you can work with in your code
#import library
import ast
#String representation of a list of dictionaries
data_string = “[{‘name’: ‘Action’, ‘id’: 28}, {‘name’: ‘Adventure’, ‘id’: 12}]”
data_list = ast.literal_eval(data_string)
print(data_list)
# Output: [{‘name’: ‘Action’, ‘id’: 28}, {‘name’: ‘Adventure’, ‘id’: 12}]

Question 32

Q

how can you generate a random array with 25 rows and 5 columns?
note: make sure it rounds to 2 decimal points.

Answer

A

np.random.rand(25,5).round(decimals=2)

Question 33

Q

.tolist()

Answer

A

Converts a string representation of a list into a Python list

Question 34

Q

.eval()

Answer

A

Converts the string representation of a list into a python list

df[‘col1’] = “[0.53, 0.59, 0.45]”

df[‘col1’] = df[‘col1’].apply(lambda u: eval(u))
df[‘col1’] = [0.53, 0.59, 0.45]

Question 35

Q

Visualize how to explode a list into rows

Answer

A

df_explode = df.explode(column=”column_name”)

Question 36

Q

Visualize how to explode data into multiple columns

Answer

A

df_new = pd.DataFrame(df_explode[“column_name”].tolist(), columns=[“col1”,”col2”,”col3”])

Question 37

Q

What chart displays data in bins and shows frequency of observations in bins

Answer

A

A histogram
df.hist() or df[‘col_name’].hist()

Question 38

Q

What chart is good for categorical data?

Answer

A

Bar charts
df.bar() or df[‘col’].bar()

Question 39

Q

What chart shows the relationship between two numerical variables on a two-dimensional graph?

Answer

A

Scatter Plot
df.scatter() or df[‘col’].scatter()
pd.plotting.scatter_matrix()

Question 40

Q

What chart displays the distribution of numerical data in min, 1st quart., median, 3rd quart., and max

Answer

A

Box plot
df.box() or df[‘col’].box()

Question 41

Q

Visualize how to plot using the kind parameter

Answer

A

df.plot(x=”col1”, y=”col2”, kind=” “)

Question 42

Q

Visualize how to retrieve the lowest value of column 1 and find the corresponding value in column 2

Answer

A

.idxmin()
#Find the index of the minimum for column 1
lowest_value_index = df[‘column 1’].idxmin()
#Access the corresponding value in column 2 using that index
lowest_value = df.loc[lowest_value_index, ‘column 2’]

Question 43

Q

What function returns the index for the minimum value of a column?

Answer

A

.idxmin()

Question 44

Q

What parameter allows you to set the size of the figure or plot in Pandas?

Answer

A

figsize

.plot(figsize=(width,height))

Question 45

Q

what allows you to set the title for your Pandas plot?

Answer

A

df.plot(title=’title of plot.’)

Question 46

Q

How can you customize the labels of the x-axis and y-axis

Answer

A

df.plot(xlabel=’X axis label’)
df.plot(ylabel=’Y axis label’)
df.plot(ylim=(min_value, max_value)) #sets the min and max values for the y-axis

Question 47

Q

How do you change the color of the Pandas plot?

Answer

A

df.plot(color=’color_name’)

Question 48

Q

How do you provide the legend in a pandas plot

Answer

A

df.plot(legend=True)

Question 49

Q

Visualize the plotting of a bar chart in order

Answer

A

x = df[‘column’].value_counts().index
df[‘column’].value_counts()[x].plot(kind=’bar’);

Question 50

Q

Define the .value_count()

Answer

A

Counts the unique values in a series and returns a new series containing the counts of each unique value.

Question 51

Q

what would occur if you used
color = [blue, red, blue, blue, red, green]
df[‘color’].value_counts()

Answer

A

color | count
blue | 3
red | 2
green | 1

Question 52

Q

what would occur if you used
color = [blue, red, blue, blue, red, green]
value_counts = df[‘color’].value_counts()
Print(value_counts.index)
print(value_counts.values)

Answer

A

Index([‘blue’,’red’,’green’], dtype=’object’)
[3 2 1]

Question 53

Q

Visualize how to convert the data from .values to a Python list

Answer

A

values_numpy = df[‘column’].values
python_values = values_numpy.values.tolist()
print(python_values)

Question 54

Q

How do you filter rows based on multiple conditions?

Answer

A

Using the bitwise OR operator |

x = df[(df[‘column’] == ‘data) | (df_02[‘column’] == ‘data2’)]

Question 55

Q

Visualize how to use the .isin() to verify that a value is in a list of options

Answer

A

df[df[‘column’].isin([‘column_1’, ‘column_2’])]

Question 56

Q

What should do before calculating the mean on a column

Answer

A

Check for NaN values and drop them
df[‘column’].isnull().sum()
df.dropna(subset=[‘column’])

Question 57

Q

Visualize how to calculate the means of two columns for years 2024 and 2025 from two different dataframes so that you can calculate the increase

df_24[‘col_2024’]
df_25[‘col_2025’]

Answer

A

Calculate the mean for both years

value_24 = df24.groupby(‘data’).agg(mean_col_2024=(‘col_2024’,’mean’)).reset_index()
value_25= df25.groupby(‘data’).agg(mean_col_2025=(‘col_2025’,’mean’)).reset_index()
#merge the two dataframes on column ‘data’
merged_dfs = pd.merge(value_24, value_25, on=’data’)
#Calculate the increase
merged_dfs[‘increase’] = merged_dfs[‘mean_col_2024’] - merged_dfs[‘mean_col_2025’]

merged_dfs

Question 58

Q

What does this code do?

df.label.unique()

Answer

A

df refers to the dataframe
.label accesses a column with the label within the df dataframe.
.unique() returns an array of the unique values present in the label column
output:
array([‘no’,’yes’, dtype=object)
You can then use the labels to filter
df_filter = df.query(‘label == “yes”’)

Question 59

Q

When can you use data.column or “dot notation?”

ex. df.column.value_counts()

Answer

A

When the column name has no spaces, special characters, or numbers at the beginning
When the column name conflicts with Dataframe such as columns or values
When you want to use dynamic column access
col_name - ‘age’
print(df.col_name)

Question 60

Q

How can I convert a date and time string to a datetime datatype?

Answer

A

Sample DataFrame with runtime as datetime

pd.to_datetime()

Question 61

Q

Visualize using a function to convert a string “Xh Ym” to a hours and minutes

Answer

A

Sample DataFrame with runtime as datetime
data = {‘runtime’: [‘1970-01-01 00:00:00.000000162’, ‘1970-01-01 00:00:00.000000120’, ‘1970-01-01 00:00:00.00000090’]}
df_m = pd.DataFrame(data)

Convert ‘runtime’ to datetime
df_m[‘runtime’] = pd.to_datetime(df_m[‘runtime’])

Function to convert datetime to “Xh Ym” format
def format_runtime(dt):
# Extract total seconds and convert to minutes
total_minutes = int(dt.timestamp() // 60) # Get total minutes
hours = total_minutes // 60 # Calculate hours
minutes = total_minutes % 60 # Calculate remaining minutes
return f”{hours}h {minutes}m” # Format as “Xh Ym”

Apply the function to the ‘runtime’ column
df_m[‘runtime’] = df_m[‘runtime’].apply(format_runtime)

Question 62

Q

Visualize the steps to breaking up the below list into new columns from a dataframe titled df_m with the column labeled ‘genres’
[{“id”: 28, “name”: “Action”},{“id”: 12, “name”: “Adventure”}]

Answer

A

Create a new DataFrame

import pandas as pd
import ast

Assuming df_m is your original DataFrame
# Creating a new DataFrame to clean up the genre column
df_genre = df_m[‘genres’]
df_g = pd.DataFrame({‘genres’: df_genre})
# Converting the string representation of lists into actual lists
df_g[‘genres’] = df_g[‘genres’].apply(ast.literal_eval)
# Exploding the genres column to get each genre in a separate row
df_g = df_g.explode(‘genres’)
# Now extract the genre’s name and id into separate columns
df_g[‘genre_name’] = df_g[‘genres’].apply(lambda x: x[‘name’] if isinstance(x, dict) else None)
df_g[‘genre_id’] = df_g[‘genres’].apply(lambda x: x[‘id’] if isinstance(x, dict) else None)
# Dropping the original ‘genres’ column if you no longer need it
df_g = df_g.drop(columns=[‘genres’])
# Display the resulting DataFrame with genre names and IDs
print(df_g[[‘genre_name’, ‘genre_id’]])

Question 63

Q

How could you keep multiple descriptors for a value that would lead to producing duplicates
i.e. a movie with multiple genres, a person with multiple names, a car with multiple colors

Answer

A

Aggregate Information: If you want to summarize the information instead of listing all combinations, you could aggregate the data. For example, you could concatenate the genre names into a single string for each movie.
dataframe[‘genres’] = df_g.groupby(‘Movie Title’)[‘genre_name’].transform(lambda x: ‘, ‘.join(x))
df_g = df_g.drop_duplicates(subset=[‘Movie Title’]) # Keep only one row per movie

Question 64

Q

Visualize how to drop a row with a specific index

Answer

A

dataframe.drop(index=###, errors=’ignore’)

Answer 65

A

import sqlite3
conn = sqlite3.connect()
c = conn.cursor()
data = pd.read()

Answer 66

A

var = pd.read_sql(“SQL clause”, sqlite3.connect(‘file_path_to_database’))

Communicating Results Flashcards

Summarizing descriptive statistics, plotting visualizations, drawing conclustions, and customizing visuals to communicate results