Data Preprocessing/Cleaning Flashcards

Question 1

Q

How do you print the first 20 rows of a dataframe called df?

Answer

A

df.head(20)

Question 2

Q

What does the .info() method called on a pandas dataframe show?

Answer

A

The object type, column names data types, size of dataframe in memory, column counts, and range of row index.

Question 3

Q

How do you print the column names of a pandas dataframe, df?

Answer

A

df.columns

Question 4

Q

How do you iterate over columns of a dataframe, df, and remove spaces?

Answer

A

df.columns.str.replace(“ “, “”)

Question 5

Q

How do you return basic summary statistics for the continuous variables in a pandas dataframe, df, and what does this method show?

Answer

A

df.describe():

Row count
Mean
Std
Min
25% Quantile
50% Quantile
75% Quantile
Max

Question 6

Q

How do you print a column (series) of a dataframe, df?`

What does this command also return?

Answer

A

df[‘column name’]
df.column_name

This will also return the number of rows as length and the datatype.

Question 7

Q

How do you change the data type of a pandas dataframe, df, column?

Answer

A

df.column.astype(‘x’)

Question 8

Q

Use a lambda function to remove the left whitespace of all rows of dataframe column’s categories, df, and remove all periods from category names

Answer

A

df.column.apply(lambda x: x.lstrip().replace(“.”,””))

Question 9

Q

How would you show how many cars were manufactured for each origin country in the cars dataset, for every year?

Answer

A

pd.crosstab(cars.year, cars.brand)

Question 10

Q

What are two methods for returning rows and columns from a dataframe, df, with a specified condition?

Answer

A

df. loc[]

df. iloc[]

Question 11

Q

How do you return the sum of all NaN values from columns in a dataframe, df?

Answer

A

df.isna().sum()

Question 12

Q

How do you fill all of the NaN values of the cylinders column with the rounded mean of that column?

Answer

A

cars.cylinders.fillna(round(np.nanmean(cars.cylinders)), inplace = True)

Question 13

Q

How do you calculate the mode for a categorical column, brand, excluding “Missing” category for the brand column?

Then how do you replace all of the “Missing” values with the computed mode value?

Then how do you remove the “Missing” category?

Answer

A

mode_brand = cars.brand[cars.brand != “Missing”].mode()

cars. brand[cars.brand == “Missing” = mode_brand[0]
cars. brand.cat.remove_unused_categories()

Question 14

Q

How do you calculate the min and max of a pandas dataframe, df, using numpy?

Answer

A

np. min(df.column)

np. max(df.column)

Question 15

Q

How do you perform min-max normalization?

Answer

A

X(min-max) = X - X(min) / X(max) - X(min)

Question 16

Q

What is one benefit and one deficit of min-max normalization

Answer

A

The y-axis is converted to 0 -> 1, no negatives, guarantees all features will have exact same scale.

Sensitive to outliers.

Question 17

Q

How do you import MinMaxScaler from sklearn and use it to transform all the columns of the cars dataset (besides the last categorical column)?

Answer

A

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
cars_minmax = scaler.fit_transform(cars.iloc[:, :-1])
cars_minmax = pd_DataFrame(cars_minmax)

Question 18

Q

How is the Z-Score transformation function defined?

Answer

A

X (z-score) = X - X(mean) / X (std)

Question 19

Q

How do you calculate the standard deviation of a dataframe, df, column using numpy?

Answer

A

np.nanstd(df.column)

Question 20

Q

How do you fit a standard normalization transformation to all columns of the cars dataframe except for the last one using sklearn.preprocessing?

Answer

A

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
cars_zscore = scaler.fit_transform(cars.iloc[:,:-1])
cars_zscore = pd.DataFrame(cars_zscore)
cars_zscore.columns  = cars.iloc[:, :-1].columns

Question 21

Q

How do you achieve normality?

Answer

A

Achieve symmetry

Question 22

Q

How do you achieve symmetry?

Answer

A

Reduce skewness

Question 23

Q

How do you calculate the skewness of a distribution?

Answer

A

3(mean-median)/std

Question 24

Q

How do you interpret skewness based on mean and median values?

Answer

A

mean > median | Right skew
mean < median | Left skew
mean = median | No skew (normal distribution)
(positive number : right)
(negative number: left)

Question 25

Q

Do min-max or z-score standardization affect skewness?

Question 26

Q

What are the three ways to reduce skewness?

Answer

A

Log Transformation
Square root transformation
Inverted Square Root Transformation

Question 27

Q

How do you apply a log transformation to a numerical column?

Answer

A

df.column.apply(np.log)

Question 28

Q

How do you apply a square root transformation to a numerical column?

Answer

A

df.column.apply(np.sqrt)

Question 29

Q

How do you apply an inverted square root transformation to a numerical column?

Answer

A

df.column.apply(lambda x: np.reciprocal(np.sqrt(x)))

Question 30

Q

How do you show a Q-Q plot for a data distribution using sipy.stats?

Answer

A

import scipy.stats as stats

stats.probplot(distribution_column. dist = “norm”, plot = plt)

Brainscape's Knowledge GenomeTM

Data Preprocessing/Cleaning Flashcards

Brainscape's Knowledge Genome^TM