Communicating Results Flashcards
Summarizing descriptive statistics, plotting visualizations, drawing conclustions, and customizing visuals to communicate results
What is the .groupby() method
This allows you to group data by columns and aggregate info about groupings. The numeric_only excludes values that aren’t numeric.
df.groupby(“column_name”).mean(numeric_only=True)
or
df.groupby([“workclass”,”race”], as_index=False)[“capital-gain”].mean()
What is summation or .sum()?
It aggregates data vertically .sum(axis=0) or horizontally .sum(axis=1).
df_census[[“capital_gain”,”capital-loss”]].sum()
Visualize how to get the sum while using .groupby and then sort the values in descending order
df.groupby(by=”column”).sum(numeric_only=True).sort_values(by=”column2”, ascending=False)
What are the measures of center?
Mean = .mean()
Median = .medain()
Mode = .mode()
What is the mean?
It is the average or sum of all numbers in set/by number of values in the set
What is the median
The center value in a set. Always sort the values first then calculate the median.
What is the mode?
It is the value with the highest frequency in a set
If you had a column named color that contained the following values
df = { ‘color’ : [‘red,’ ‘blue,’ ‘red,’ ‘green,’ ‘blue,’ ‘blue’]}, what you would get if you used count as df[‘color’].value_counts()?
color : count
blue : 3
red : 2
green :1
What does df[‘col’].values do?
It returns a NumPy array containing all the values in the column, including duplicates, in the order they appear. It contains only the raw data. no column labels or indexes.
What does df.index do?
.index returns the row labels of a data frame or a series. You can use .index to view or inspect or change the row labels.
What does .columns do?
Provides an index object contianing the names of all the columns.
Why does df.index output: RangeIndex(start=0, stop=3, step=1)?
Pandas assigns a numerical RangeIndex starting from 0 when you create a DataFrame without specifying row labels.
This describes the range of the row indices in the DataFrame:
start=0: The first row index starts at 0.
stop=3: The range stops before 3, so the indices are [0, 1, 2].
step=1: The indices increment by 1 between rows.
Visualize how to:
- Use .index
to get the unique values.
- Use .values
to get the corresponding counts.
- Combine .index
and .values
to iterate through the value counts.
- Use .sort_index()
or .sort_values()
to sort as needed.
The .index gives you the unique values(categories)
unique_values = count.index
print(unique_values)
The counts themselves are accessed using .values
counts_values = counts.values
print(counts_values)
Sort the results by unique values
sorted_counts = counts.sort_index()
What is .zip()?
Zip is a python function that combines two or more iterables like lists, tuples or strings into a single iterator of tuples
ex.
list1 = [1,2,3]
list2 = [‘a’,’b’,’c’]
zipped = zip(list 1, list2)
print(list(zipped))
Output : [(1, ‘a’), (2, ‘b’), (3, ‘c’)]
What are the two ways to check if a column contains a value
A.
Uses the bitwise | to filter based on two conditions
var = df[(df[‘col’] == ‘data’) | (df[‘col’] == ‘data2’)]
B.
Uses the .isin() to check if the values exist in the column.
var = df[df[‘col’].isin([‘data1’, ‘data2’])]
What is the union of two events?
This calculates the probability that either event P(A) or P(B) happens (or both).
It’s P(A or B)
You add the probabilities of P(A) and P(B) because you’re counting all outcomes that include P(A) or P(B).
You subtract the overlap ((A∩B))
What is the formula for union of two events?
P (A or B) = P(A) + P(B) - P(A∩B)
What is the intersection of two events
It’s P(A and B), calculating that the events A and B happen at the same time
You only consider the overlap between P(A) and P(B)—where both events occur.
What is the formula for the intersection of two events?
P(A and B) = P(A∩B)
What is the conditional probability
This calculates the probability that A happens given B already happened. You’re “zooming in” on the subset of outcomes where P(B) happens and asking, “What proportion of those also include P(A)?”
What is the formula of conditional probablity
P(A | B) = (A∩B)/P(B)
Visualize how to find the count for unique values in a column?
filter for the unique values in the column
query_df = df.query(‘column == “data” or column == “data2”’)
#count the unique values
unique_column_value_count = query_df[‘column’].nunique()
Visualize how to convert the column names to lowercase
df[‘column’].str.lower()
Explain what each part of this code means
df.groupby(‘veh_class’).agg(mean_cmb_mpg=(‘cmb_mpg’,’mean’)).reset_index()
df.groupby(‘veh_class’) grows the rows by the veh_class column
mean_cmb_mpg is the name of the new column that will hold the mean values.
(‘cmb_mpg’, ‘mean’) specifies that you want to calculate the mean of the cmb_mpg column.
.reset_index() resets the index for the new dataframe created
T or F: you can use .agg() with NaNs
False. You have to use.dropna() or .fillna() first before calculating the mean.