Merging Dataframes + handle null values + Aggregation Flashcards by Karthik H

How do you merge two DataFrames in pandas?

You can merge two DataFrames using the pd.merge() function. For example, newBios2 = pd.merge(newBios, noc, left_on='born_country', right_on='NOC', how='left', suffixes=['', '_nocdf']) merges newBios and noc DataFrames on the born_country and NOC columns, respectively.

How well did you know this?

Not at all

Perfectly

What is the purpose of the suffixes parameter in pd.merge()?

The suffixes parameter is used to differentiate columns with the same name in the two DataFrames being merged. For example, suffixes=['', '_nocdf'] will add _nocdf to the columns from the right DataFrame that have the same name as columns in the left DataFrame.

How well did you know this?

Not at all

Perfectly

How can you handle null values in a DataFrame?

You can handle null values using methods like fillna(), interpolate(), and dropna(). For example, coffee.fillna(coffee['Units Sold'].mean()) replaces null values with the mean of the column, while coffee.dropna() removes rows with null values.

How well did you know this?

Not at all

Perfectly

What does the interpolate() method do?

The interpolate() method fills null values by identifying a pattern in the column. It is useful when you want to fill missing values based on the values of neighboring rows. For example, coffee['Units Sold'].interpolate() fills null values in the ‘Units Sold’ column by interpolation.

How well did you know this?

Not at all

Perfectly

How do you aggregate data in pandas?

You can aggregate data using the groupby() method followed by an aggregation function like sum(), mean(), etc. For example, coffee.groupby(['Day', 'Coffee Type']).agg({'Units Sold': 'sum'}) groups the data by ‘Day’ and ‘Coffee Type’ and calculates the sum of ‘Units Sold’ for each group.

How well did you know this?

Not at all

Perfectly

What is the purpose of the agg() function in pandas?

The agg() function is used to apply one or more aggregation functions to a DataFrame or a grouped DataFrame. For example, coffee.groupby('Coffee Type').agg({'Units Sold': 'mean'}) calculates the mean of ‘Units Sold’ for each ‘Coffee Type’.

How well did you know this?

Not at all

Perfectly

How do you pivot a DataFrame in pandas?

You can pivot a DataFrame using the pivot() method. For example, pivot = coffee.pivot(columns='Coffee Type', index='Day', values='Units Sold') creates a pivot table where ‘Coffee Type’ becomes the columns, ‘Day’ becomes the index, and ‘Units Sold’ are the values.

How well did you know this?

Not at all

Perfectly

What does the pivot() method do?

The pivot() method reshapes the DataFrame by turning unique values in a column into new columns. It is useful for creating a summary table where you want to see the relationship between two variables. For example, pivot = coffee.pivot(columns='Coffee Type', index='Day', values='Units Sold') creates a pivot table showing ‘Units Sold’ for each ‘Coffee Type’ and ‘Day’.

How well did you know this?

Not at all

Perfectly

How do you calculate the sum of values in a pivot table?

You can calculate the sum of values in a pivot table using the sum() method. For example, pivot.sum() calculates the sum of values for each column, while pivot.sum(axis=1) calculates the sum of values for each row.

How well did you know this?

Not at all

Perfectly

How do you access a specific value in a pivot table?

You can access a specific value in a pivot table using the loc[] method. For example, pivot.loc['Monday', 'Latte'] retrieves the value for ‘Latte’ on ‘Monday’.

How well did you know this?

Not at all

Perfectly

How do you load data from a CSV file into a DataFrame?

Use pd.read_csv(‘file_path.csv’)

How well did you know this?

Not at all

Perfectly

What does coffee.loc[3] do?

Returns the row at index position 3 in the DataFrame

How well did you know this?

Not at all

Perfectly

How to select multiple rows by index using loc?

coffee.loc[[2, 3, 5]]

How well did you know this?

Not at all

Perfectly

What’s the difference between loc and iloc?

loc uses labels/names, iloc uses numerical indexes

How well did you know this?

Not at all

Perfectly

How to slice rows and select specific columns using loc?

coffee.loc[3:8, ‘Day’]

How well did you know this?

Not at all

Perfectly

What does sort_values() do?

Sorts DataFrame by specified column values

How to sort by multiple columns with different orders?

df.sort_values([col1, col2], ascending=[False, True])

What’s the purpose of iterrows()?

Iterates through DataFrame rows as (index, Series) pairs

How to merge two DataFrames?

pd.merge(df1, df2, on=’common_column’, how=’join_type’)

What’s the difference between inner and left joins?

Inner keeps only matching rows, left keeps all from left DataFrame

How to handle missing values using fillna()?

df.fillna(value) replaces nulls with specified value

What does interpolate() do?

Fills null values by estimating missing values between existing data points

How to drop rows with missing values?

df.dropna()

What’s the purpose of groupby()?

Groups data by specified columns for aggregation

How to calculate mean sales per coffee type?

coffee.groupby('Coffee Type')['Units Sold'].mean()

What's the difference between sum() and mean() in aggregation?

Sum calculates total, mean calculates average

How to create a pivot table in Pandas?

df.pivot(index='rows', columns='columns', values='values')

What does pivot.sum() calculate?

Column-wise totals of the pivot table

How to get row-wise totals in a pivot?

pivot.sum(axis=1)

What's the purpose of the suffixes parameter in merge()?

Adds suffixes to duplicate column names from merged DataFrames

How to rename columns after merging?

Use df.rename(columns={'old':'new'}, inplace=True)

What does coffee.isna().sum() show?

Count of null values in each column

How to filter rows where 'Units Sold' is null?

coffee[coffee['Units Sold'].isna()]

What's the difference between fillna(mean()) and interpolate()?

Mean fills with average, interpolate estimates based on neighbors

How to aggregate multiple statistics at once?

df.agg({'col': ['mean', 'sum', 'max']})

What's the output of coffee['Units Sold'].unique()?

Array of unique values in that column

How to sort a DataFrame by index?

df.sort_index()