Data Preprocessing/Cleaning Flashcards

1
Q

How do you print the first 20 rows of a dataframe called df?

A

df.head(20)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does the .info() method called on a pandas dataframe show?

A

The object type, column names data types, size of dataframe in memory, column counts, and range of row index.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you print the column names of a pandas dataframe, df?

A

df.columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you iterate over columns of a dataframe, df, and remove spaces?

A

df.columns.str.replace(“ “, “”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you return basic summary statistics for the continuous variables in a pandas dataframe, df, and what does this method show?

A

df.describe():

Row count
Mean
Std
Min
25% Quantile
50% Quantile
75% Quantile
Max
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do you print a column (series) of a dataframe, df?`

What does this command also return?

A

df[‘column name’]
df.column_name

This will also return the number of rows as length and the datatype.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you change the data type of a pandas dataframe, df, column?

A

df.column.astype(‘x’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Use a lambda function to remove the left whitespace of all rows of dataframe column’s categories, df, and remove all periods from category names

A

df.column.apply(lambda x: x.lstrip().replace(“.”,””))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How would you show how many cars were manufactured for each origin country in the cars dataset, for every year?

A

pd.crosstab(cars.year, cars.brand)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are two methods for returning rows and columns from a dataframe, df, with a specified condition?

A

df. loc[]

df. iloc[]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you return the sum of all NaN values from columns in a dataframe, df?

A

df.isna().sum()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you fill all of the NaN values of the cylinders column with the rounded mean of that column?

A

cars.cylinders.fillna(round(np.nanmean(cars.cylinders)), inplace = True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How do you calculate the mode for a categorical column, brand, excluding “Missing” category for the brand column?

Then how do you replace all of the “Missing” values with the computed mode value?

Then how do you remove the “Missing” category?

A

mode_brand = cars.brand[cars.brand != “Missing”].mode()

cars. brand[cars.brand == “Missing” = mode_brand[0]
cars. brand.cat.remove_unused_categories()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you calculate the min and max of a pandas dataframe, df, using numpy?

A

np. min(df.column)

np. max(df.column)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you perform min-max normalization?

A

X(min-max) = X - X(min) / X(max) - X(min)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is one benefit and one deficit of min-max normalization

A

The y-axis is converted to 0 -> 1, no negatives, guarantees all features will have exact same scale.

Sensitive to outliers.

17
Q

How do you import MinMaxScaler from sklearn and use it to transform all the columns of the cars dataset (besides the last categorical column)?

A

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
cars_minmax = scaler.fit_transform(cars.iloc[:, :-1])
cars_minmax = pd_DataFrame(cars_minmax)
18
Q

How is the Z-Score transformation function defined?

A

X (z-score) = X - X(mean) / X (std)

19
Q

How do you calculate the standard deviation of a dataframe, df, column using numpy?

A

np.nanstd(df.column)

20
Q

How do you fit a standard normalization transformation to all columns of the cars dataframe except for the last one using sklearn.preprocessing?

A

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
cars_zscore = scaler.fit_transform(cars.iloc[:,:-1])
cars_zscore = pd.DataFrame(cars_zscore)
cars_zscore.columns  = cars.iloc[:, :-1].columns
21
Q

How do you achieve normality?

A

Achieve symmetry

22
Q

How do you achieve symmetry?

A

Reduce skewness

23
Q

How do you calculate the skewness of a distribution?

A

3(mean-median)/std

24
Q

How do you interpret skewness based on mean and median values?

A
mean > median | Right skew
mean < median | Left skew
mean = median | No skew (normal distribution)
(positive number : right)
(negative number: left)
25
Q

Do min-max or z-score standardization affect skewness?

A

No

26
Q

What are the three ways to reduce skewness?

A
  1. Log Transformation
  2. Square root transformation
  3. Inverted Square Root Transformation
27
Q

How do you apply a log transformation to a numerical column?

A

df.column.apply(np.log)

28
Q

How do you apply a square root transformation to a numerical column?

A

df.column.apply(np.sqrt)

29
Q

How do you apply an inverted square root transformation to a numerical column?

A

df.column.apply(lambda x: np.reciprocal(np.sqrt(x)))

30
Q

How do you show a Q-Q plot for a data distribution using sipy.stats?

A

import scipy.stats as stats

stats.probplot(distribution_column. dist = “norm”, plot = plt)