Data Cleaning and Wrangling Flashcards
How do you check for missing values in a DataFrame?
Use df.isnull() to get a DataFrame of the same shape with True for missing values and False otherwise. Use df.isnull().sum() to get the count of missing values in each column.
import pandas as pd
data = {‘A’: [1, 2, None], ‘B’: [4, None, 6]}
df = pd.DataFrame(data)
print(df.isnull()) # Shows True for missing values
print(df.isnull().sum()) # Count of missing values per column
What function would you use to drop rows with missing values?
Use df.dropna(). You can specify axis=0 (default) to drop rows or axis=1 to drop columns.
df = df.dropna() # Drops rows with any missing values
How do you fill missing values with the mean of a column?
Use df[‘column_name’].fillna(df[‘column_name’].mean(), inplace=True).
df[‘A’].fillna(df[‘A’].mean(), inplace=True)
print(df)
What method allows you to rename columns in a DataFrame?
Use df.rename(columns={‘old_name’: ‘new_name’}, inplace=True).
df.rename(columns={‘A’: ‘Column_A’, ‘B’: ‘Column_B’}, inplace=True)
print(df)
How can you merge two DataFrames on a common column?
Use pd.merge(df1, df2, on=’common_column’, how=’inner’). Replace ‘inner’ with ‘outer’, ‘left’, or ‘right’ for different types of joins.
df1 = pd.DataFrame({‘id’: [1, 2], ‘value’: [‘A’, ‘B’]})
df2 = pd.DataFrame({‘id’: [2, 3], ‘value’: [‘C’, ‘D’]})
merged = pd.merge(df1, df2, on=’id’, how=’inner’)
print(merged)
What function do you use to concatenate DataFrames vertically or horizontally?
Use pd.concat([df1, df2], axis=0) for vertical concatenation (rows) or axis=1 for horizontal concatenation (columns).
concat_rows = pd.concat([df1, df2], axis=0)
concat_cols = pd.concat([df1, df2], axis=1)
How do you reorder columns in a DataFrame?
Use df = df[[‘col1’, ‘col2’, ‘col3’]] to specify the desired order of columns.
df = df[[‘B’, ‘A’]]
print(df)
How can you split a column containing delimited strings into multiple columns?
Use df[‘column_name’].str.split(delimiter, expand=True).
df[‘FullName’] = [‘John Doe’, ‘Jane Smith’]
split = df[‘FullName’].str.split(‘ ‘, expand=True)
print(split)
How do you convert a column to a specific data type, e.g., integer or datetime?
Use df[‘column_name’] = df[‘column_name’].astype(int) for integers, or pd.to_datetime(df[‘column_name’]) for datetime.
df[‘Date’] = [‘2024-12-10’, ‘2024-12-11’]
df[‘Date’] = pd.to_datetime(df[‘Date’])
print(df)
What method can you use to find and replace substrings in a column?
Use df[‘column_name’].str.replace(‘old_substring’, ‘new_substring’).
df[‘A’] = [‘abc’, ‘def’]
df[‘A’] = df[‘A’].str.replace(‘a’, ‘z’)
print(df)
How do you pivot a DataFrame from long format to wide format?
Use df.pivot(index=’index_column’, columns=’column_to_pivot’, values=’values_column’).
long_df = pd.DataFrame({‘ID’: [1, 1, 2], ‘Year’: [2023, 2024, 2023], ‘Value’: [10, 20, 15]})
wide_df = long_df.pivot(index=’ID’, columns=’Year’, values=’Value’)
print(wide_df)
How can you reshape a DataFrame from wide format to long format?
Use pd.melt(df, id_vars=’id_column’, value_vars=[‘col1’, ‘col2’]).
wide_df = pd.DataFrame({‘ID’: [1, 2], ‘2023’: [10, 15], ‘2024’: [20, 25]})
long_df = pd.melt(wide_df, id_vars=’ID’, var_name=’Year’, value_name=’Value’)
print(long_df)
What method would you use to remove duplicate rows in a DataFrame?
Use df.drop_duplicates(inplace=True).
df = pd.DataFrame({‘A’: [1, 1, 2], ‘B’: [3, 3, 4]})
df.drop_duplicates(inplace=True)
print(df)
How do you sort a DataFrame by multiple columns?
Use df.sort_values(by=[‘col1’, ‘col2’], ascending=[True, False]).
df = pd.DataFrame({‘A’: [2, 1, 2], ‘B’: [3, 4, 1]})
df.sort_values(by=[‘A’, ‘B’], ascending=[True, False], inplace=True)
print(df)
How can you add a new column to a DataFrame based on a condition?
Use df[‘new_column’] = np.where(df[‘condition_column’] > threshold, ‘Value1’, ‘Value2’).
import numpy as np
df[‘New_Col’] = np.where(df[‘A’] > 1, ‘High’, ‘Low’)
print(df)