Merging Dataframes + handle null values + Aggregation Flashcards
How do you merge two DataFrames in pandas?
You can merge two DataFrames using the pd.merge()
function. For example, newBios2 = pd.merge(newBios, noc, left_on='born_country', right_on='NOC', how='left', suffixes=['', '_nocdf'])
merges newBios
and noc
DataFrames on the born_country
and NOC
columns, respectively.
What is the purpose of the suffixes
parameter in pd.merge()
?
The suffixes
parameter is used to differentiate columns with the same name in the two DataFrames being merged. For example, suffixes=['', '_nocdf']
will add _nocdf
to the columns from the right DataFrame that have the same name as columns in the left DataFrame.
How can you handle null values in a DataFrame?
You can handle null values using methods like fillna()
, interpolate()
, and dropna()
. For example, coffee.fillna(coffee['Units Sold'].mean())
replaces null values with the mean of the column, while coffee.dropna()
removes rows with null values.
What does the interpolate()
method do?
The interpolate()
method fills null values by identifying a pattern in the column. It is useful when you want to fill missing values based on the values of neighboring rows. For example, coffee['Units Sold'].interpolate()
fills null values in the ‘Units Sold’ column by interpolation.
How do you aggregate data in pandas?
You can aggregate data using the groupby()
method followed by an aggregation function like sum()
, mean()
, etc. For example, coffee.groupby(['Day', 'Coffee Type']).agg({'Units Sold': 'sum'})
groups the data by ‘Day’ and ‘Coffee Type’ and calculates the sum of ‘Units Sold’ for each group.
What is the purpose of the agg()
function in pandas?
The agg()
function is used to apply one or more aggregation functions to a DataFrame or a grouped DataFrame. For example, coffee.groupby('Coffee Type').agg({'Units Sold': 'mean'})
calculates the mean of ‘Units Sold’ for each ‘Coffee Type’.
How do you pivot a DataFrame in pandas?
You can pivot a DataFrame using the pivot()
method. For example, pivot = coffee.pivot(columns='Coffee Type', index='Day', values='Units Sold')
creates a pivot table where ‘Coffee Type’ becomes the columns, ‘Day’ becomes the index, and ‘Units Sold’ are the values.
What does the pivot()
method do?
The pivot()
method reshapes the DataFrame by turning unique values in a column into new columns. It is useful for creating a summary table where you want to see the relationship between two variables. For example, pivot = coffee.pivot(columns='Coffee Type', index='Day', values='Units Sold')
creates a pivot table showing ‘Units Sold’ for each ‘Coffee Type’ and ‘Day’.
How do you calculate the sum of values in a pivot table?
You can calculate the sum of values in a pivot table using the sum()
method. For example, pivot.sum()
calculates the sum of values for each column, while pivot.sum(axis=1)
calculates the sum of values for each row.
How do you access a specific value in a pivot table?
You can access a specific value in a pivot table using the loc[]
method. For example, pivot.loc['Monday', 'Latte']
retrieves the value for ‘Latte’ on ‘Monday’.
How do you load data from a CSV file into a DataFrame?
Use pd.read_csv(‘file_path.csv’)
What does coffee.loc[3] do?
Returns the row at index position 3 in the DataFrame
How to select multiple rows by index using loc?
coffee.loc[[2, 3, 5]]
What’s the difference between loc and iloc?
loc uses labels/names, iloc uses numerical indexes
How to slice rows and select specific columns using loc?
coffee.loc[3:8, ‘Day’]
What does sort_values() do?
Sorts DataFrame by specified column values
How to sort by multiple columns with different orders?
df.sort_values([col1, col2], ascending=[False, True])
What’s the purpose of iterrows()?
Iterates through DataFrame rows as (index, Series) pairs
How to merge two DataFrames?
pd.merge(df1, df2, on=’common_column’, how=’join_type’)
What’s the difference between inner and left joins?
Inner keeps only matching rows, left keeps all from left DataFrame
How to handle missing values using fillna()?
df.fillna(value) replaces nulls with specified value
What does interpolate() do?
Fills null values by estimating missing values between existing data points
How to drop rows with missing values?
df.dropna()
What’s the purpose of groupby()?
Groups data by specified columns for aggregation