Pandas Flashcards
How do you import pandas?
import pandas as pd
How do you load a CSV file into a dataframe?
df = pd.read_csv('data.csv')
How do you load a JSON file into a dataframe?
df = pd.read_json('data.json')
How do you view the first five rows of a dataframe?
df.head()
How do you view the last five rows of a dataframe?
df.tail()
How do you view information about the data?
df.info()
How do you return a new dataframe with no empty cells?
df_new = df.dropna()
How do you return the same dataframe with no empty cells?
df.dropna(inplace = True)
How do you replace empty cells with a value in the entire dataframe?
df.fillna("value", inplace = True)
How do you replace empty cells in specified column(s) with a value?
df["column"].fillna("value", inplace = True)
How do you calculate the MEAN of a column?
x = df["column"].mean()
How do you calculate the MEDIAN of a column?
x = df["column"].median()
How do you calculate the MODE of a column?
x = df["column"].mode()[0]
How do you convert a column to datetime?
df["date"] = pd.to_datetime(df["date"])
How do you remove NULL rows using specific column(s) as a reference?
df.dropna(subset=["column"...], inplace = True)
How do you replace a specific cell value?
df.loc[idx, "column"] = "new_value"
How do you loop through rows?
for x in df.index: print(x)
How do you drop a specific row?
df.drop(idx, inplace = True)
How do you view duplicate rows?
df.duplicated()
How do you drop duplicated rows?
df.drop_duplicates(inplace = True)
How do you drop column(s)?
df.drop(columns=["column", ...], inplace=True)
How do you convert a column to a different data type?
df["col"] = df["col"].astype(type)
Available types: “int”, “float”, “str”, “bool”, “datetime64”
How do you sort by column(s)?
df.sort_values(by=["col", ...], inplace=True)
How do you limit a float column to x decimal places?
df["col"] = df["col"].round(x)
How do you group a DataFrame by a column and calculate the mean of another column?
df.groupby('column_name')['another_column'].mean().reset_index()
How do you perform multiple aggregations (mean, sum) on a column after grouping?
df.groupby('column_name')['another_column'].agg(['mean', 'sum']).reset_index()
How do you specify custom names for the aggregated columns instead of using default names like ‘mean’ or ‘sum’?
df.groupby('column_name')['another_column'].agg( custom_name_1='mean', custom_name_2='sum' ).reset_index()
How do you extract the year from a datetime column?
df['year'] = df['datetime_column'].dt.year
How do you extract the month from a datetime column?
df['month'] = df['datetime_column'].dt.month
How do you extract the day of the week from a datetime column (where Monday is 0 and Sunday is 6)?
df['weekday'] = df['datetime_column'].dt.weekday
How do you extract the day of the month from a datetime column?
df['day_of_month'] = df['datetime_column'].dt.day
How do you extract the hour from a datetime column?
df['hour'] = df['datetime_column'].dt.hour
How do you calculate the difference between two datetime columns (in days)?
df['date_diff'] = (df['datetime_column2'] - df['datetime_column1']).dt.days
How do you filter rows where the datetime column is within a specific date range?
df_filtered = df[(df['datetime_column'] >= '2023-01-01') & (df['datetime_column'] <= '2023-12-31')]
How do you get the current date and time in pandas?
current_datetime = pd.to_datetime('now')
How do you add a specific number of days to a datetime column?
df['new_datetime'] = df['datetime_column'] + pd.Timedelta(days=7)
How do you calculate the difference between two datetime columns in hours?
df['time_diff_hours'] = (df['datetime_column2'] - df['datetime_column1']).dt.total_seconds() / 3600
How do you extract the quarter from a datetime column?
df['quarter'] = df['datetime_column'].dt.quarter
How do you convert a datetime column to a string in a specific format?
df['datetime_str'] = df['datetime_column'].dt.strftime('%Y-%m-%d %H:%M:%S')
How do you extract the week number from a datetime column?
df['week_number'] = df['datetime_column'].dt.isocalendar().week
How do you filter rows where the datetime column is in the last 30 days?
df_filtered = df[df['datetime_column'] >= pd.to_datetime('today') - pd.Timedelta(days=30)]
How do you convert a datetime column to a different timezone?
df['datetime_column'] = df['datetime_column'].dt.tz_localize('UTC').dt.tz_convert('America/New_York')
How do you replace all occurrences of a specific value in a column with another value?
df['column_name'] = df['column_name'].replace(old_value, new_value)
How do you filter rows based on a condition applied to a column?
df_filtered = df[df['column_name'] > threshold]
How do you rename columns in a DataFrame?
df = df.rename(columns={'old_name': 'new_name', ...})
How do you create a new column based on applying a function to another column?
df['new_column'] = df['existing_column'].apply(function)
How do you sort a DataFrame by one or more columns?
df_sorted = df.sort_values(by=['column_name1', 'column_name2'], ascending=[True, False])
How do you get the unique values from a column?
unique_values = df['column_name'].unique()
How do you count the number of occurrences of each unique value in a column?
value_counts = df['column_name'].value_counts()
How do you calculate the cumulative sum of a column?
df['cumulative_sum'] = df['column_name'].cumsum()
How do you merge two DataFrames on multiple columns?
df_merged = df1.merge(df2, on=['column1', 'column2'], how='inner')
How do you concatenate multiple DataFrames vertically (stacking them on top of each other)?
df_concat = pd.concat([df1, df2], ignore_index=True)
How do you sample a random subset of rows from a DataFrame?
df_sample = df.sample(n=100)
How do you create a new column with the result of the multiplication between two columns?
df['new_column'] = df['column1'] * df['column2']
How do you set a specific column as the index of a DataFrame?
df.set_index('column_name', inplace=True)
How do you create a new DataFrame by filtering rows based on multiple conditions?
df_filtered = df[(df['column1'] > threshold1) & (df['column2'] < threshold2)]