Data Cleaning and Wrangling Flashcards

Question 1

Q

How do you check for missing values in a DataFrame?

Answer

A

Use df.isnull() to get a DataFrame of the same shape with True for missing values and False otherwise. Use df.isnull().sum() to get the count of missing values in each column.

import pandas as pd
data = {‘A’: [1, 2, None], ‘B’: [4, None, 6]}
df = pd.DataFrame(data)
print(df.isnull()) # Shows True for missing values
print(df.isnull().sum()) # Count of missing values per column

Question 2

Q

What function would you use to drop rows with missing values?

Answer

A

Use df.dropna(). You can specify axis=0 (default) to drop rows or axis=1 to drop columns.

df = df.dropna() # Drops rows with any missing values

Question 3

Q

How do you fill missing values with the mean of a column?

Answer

A

Use df[‘column_name’].fillna(df[‘column_name’].mean(), inplace=True).

df[‘A’].fillna(df[‘A’].mean(), inplace=True)
print(df)

Question 4

Q

What method allows you to rename columns in a DataFrame?

Answer

A

Use df.rename(columns={‘old_name’: ‘new_name’}, inplace=True).

df.rename(columns={‘A’: ‘Column_A’, ‘B’: ‘Column_B’}, inplace=True)
print(df)

Question 5

Q

How can you merge two DataFrames on a common column?

Answer

A

Use pd.merge(df1, df2, on=’common_column’, how=’inner’). Replace ‘inner’ with ‘outer’, ‘left’, or ‘right’ for different types of joins.

df1 = pd.DataFrame({‘id’: [1, 2], ‘value’: [‘A’, ‘B’]})
df2 = pd.DataFrame({‘id’: [2, 3], ‘value’: [‘C’, ‘D’]})
merged = pd.merge(df1, df2, on=’id’, how=’inner’)
print(merged)

Question 6

Q

What function do you use to concatenate DataFrames vertically or horizontally?

Answer

A

Use pd.concat([df1, df2], axis=0) for vertical concatenation (rows) or axis=1 for horizontal concatenation (columns).

concat_rows = pd.concat([df1, df2], axis=0)
concat_cols = pd.concat([df1, df2], axis=1)

Question 7

Q

How do you reorder columns in a DataFrame?

Answer

A

Use df = df[[‘col1’, ‘col2’, ‘col3’]] to specify the desired order of columns.

df = df[[‘B’, ‘A’]]
print(df)

Question 8

Q

How can you split a column containing delimited strings into multiple columns?

Answer

A

Use df[‘column_name’].str.split(delimiter, expand=True).

df[‘FullName’] = [‘John Doe’, ‘Jane Smith’]
split = df[‘FullName’].str.split(‘ ‘, expand=True)
print(split)

Question 9

Q

How do you convert a column to a specific data type, e.g., integer or datetime?

Answer

A

Use df[‘column_name’] = df[‘column_name’].astype(int) for integers, or pd.to_datetime(df[‘column_name’]) for datetime.

df[‘Date’] = [‘2024-12-10’, ‘2024-12-11’]
df[‘Date’] = pd.to_datetime(df[‘Date’])
print(df)

Question 10

Q

What method can you use to find and replace substrings in a column?

Answer

A

Use df[‘column_name’].str.replace(‘old_substring’, ‘new_substring’).

df[‘A’] = [‘abc’, ‘def’]
df[‘A’] = df[‘A’].str.replace(‘a’, ‘z’)
print(df)

Question 11

Q

How do you pivot a DataFrame from long format to wide format?

Answer

A

Use df.pivot(index=’index_column’, columns=’column_to_pivot’, values=’values_column’).

long_df = pd.DataFrame({‘ID’: [1, 1, 2], ‘Year’: [2023, 2024, 2023], ‘Value’: [10, 20, 15]})
wide_df = long_df.pivot(index=’ID’, columns=’Year’, values=’Value’)
print(wide_df)

Question 12

Q

How can you reshape a DataFrame from wide format to long format?

Answer

A

Use pd.melt(df, id_vars=’id_column’, value_vars=[‘col1’, ‘col2’]).

wide_df = pd.DataFrame({‘ID’: [1, 2], ‘2023’: [10, 15], ‘2024’: [20, 25]})
long_df = pd.melt(wide_df, id_vars=’ID’, var_name=’Year’, value_name=’Value’)
print(long_df)

Question 13

Q

What method would you use to remove duplicate rows in a DataFrame?

Answer

A

Use df.drop_duplicates(inplace=True).

df = pd.DataFrame({‘A’: [1, 1, 2], ‘B’: [3, 3, 4]})
df.drop_duplicates(inplace=True)
print(df)

Question 14

Q

How do you sort a DataFrame by multiple columns?

Answer

A

Use df.sort_values(by=[‘col1’, ‘col2’], ascending=[True, False]).

df = pd.DataFrame({‘A’: [2, 1, 2], ‘B’: [3, 4, 1]})
df.sort_values(by=[‘A’, ‘B’], ascending=[True, False], inplace=True)
print(df)

Question 15

Q

How can you add a new column to a DataFrame based on a condition?

Answer

A

Use df[‘new_column’] = np.where(df[‘condition_column’] > threshold, ‘Value1’, ‘Value2’).

import numpy as np
df[‘New_Col’] = np.where(df[‘A’] > 1, ‘High’, ‘Low’)
print(df)

Data Cleaning and Wrangling Flashcards

(15 cards)