Pandas Flashcards
Create a DataFrame from a dictionary where keys are column names and values are lists of data.
import pandas as pd
data = {‘name’: [‘Alice’, ‘Bob’, ‘Charlie’],
‘age’: [25, 30, 35]}
df = pd.DataFrame(data)
Select a single column named ‘age’ from a DataFrame named ‘df’.
age_column = df[‘age’]
Filter rows where the ‘age’ column is greater than 30.
filtered_df = df[df[‘age’] > 30]
Add a new column named ‘income’ to the DataFrame ‘df’ with random income values.
import numpy as np
df[‘income’] = np.random.randint(30000, 80000, size=len(df))
Group the DataFrame ‘df’ by the ‘gender’ column and calculate the mean age for each group.
grouped_df = df.groupby(‘gender’)[‘age’].mean()
Sort the DataFrame ‘df’ by the ‘age’ column in descending order.
sorted_df = df.sort_values(by=’age’, ascending=False)
Replace missing values in the ‘income’ column with the mean income.
df[‘income’].fillna(df[‘income’].mean(), inplace=True)
Merge two DataFrames ‘df1’ and ‘df2’ based on a common key column.
merged_df = pd.merge(df1, df2, on=’common_key_column’)
Drop the column ‘income’ from the DataFrame ‘df’.
df.drop(columns=[‘income’], inplace=True)
Rename the column ‘age’ to ‘years’ in the DataFrame ‘df’.
df.rename(columns={‘age’: ‘years’}, inplace=True)
Calculate the sum of the ‘income’ column in the DataFrame ‘df’.
total_income = df[‘income’].sum()
Apply a function that converts all ‘age’ values to months in the DataFrame ‘df’.
df[‘age’] = df[‘age’].apply(lambda x: x * 12)
Pivot the DataFrame ‘df’ with ‘gender’ as index and ‘age’ as columns, filling NaNs with 0.
pivoted_df = df.pivot_table(index=’gender’, columns=’age’, fill_value=0)
Convert a column named ‘date’ to datetime format in the DataFrame ‘df’.
df[‘date’] = pd.to_datetime(df[‘date’])
Select rows where the ‘gender’ column is ‘female’.
female_rows = df[df[‘gender’] == ‘female’]
Create dummy variables for the ‘gender’ column in the DataFrame ‘df’.
dummy_df = pd.get_dummies(df, columns=[‘gender’])
Concatenate two DataFrames ‘df1’ and ‘df2’ vertically.
concatenated_df = pd.concat([df1, df2])
Randomly sample 10 rows from the DataFrame ‘df’.
sampled_df = df.sample(n=10)
Replace all occurrences of ‘Male’ with ‘M’ and ‘Female’ with ‘F’ in the ‘gender’ column.
df[‘gender’].replace({‘Male’: ‘M’, ‘Female’: ‘F’}, inplace=True)
Export the DataFrame ‘df’ to a CSV file named ‘data.csv’.
df.to_csv(‘data.csv’, index=False)
Find the unique values in the ‘category’ column of the DataFrame ‘df’.
unique_categories = df[‘category’].unique()
Calculate the mean, median, and standard deviation of the ‘height’ column in the DataFrame ‘df’.
mean_height = df[‘height’].mean()
median_height = df[‘height’].median()
std_dev_height = df[‘height’].std()
Apply a custom function to calculate the 75th percentile of the ‘income’ column in the DataFrame ‘df’.
percentile_75 = df[‘income’].quantile(0.75)
Extract the first word from each entry in the ‘name’ column of the DataFrame ‘df’.
first_word = df[‘name’].str.split().str.get(0)
Convert the ‘grade’ column in the DataFrame ‘df’ to categorical type.
df[‘grade’] = df[‘grade’].astype(‘category’)
Remove duplicate rows from the DataFrame ‘df’ based on all columns.
df.drop_duplicates(inplace=True)
Fill missing values in the ‘weight’ column with the median weight in the DataFrame ‘df’.
median_weight = df[‘weight’].median()
df[‘weight’].fillna(median_weight, inplace=True)
Merge two DataFrames ‘df1’ and ‘df2’ on ‘key_column’ from ‘df1’ and ‘other_key_column’ from ‘df2’.
merged_df = pd.merge(df1, df2, left_on=’key_column’, right_on=’other_key_column’)
Calculate the cumulative sum of the ‘sales’ column in the DataFrame ‘df’.
cumulative_sales = df[‘sales’].cumsum()
Convert the timezone of the ‘timestamp’ column to ‘UTC’ in the DataFrame ‘df’.
df[‘timestamp’] = df[‘timestamp’].dt.tz_convert(‘UTC’)