Data Preprocessing/Cleaning Flashcards
How do you print the first 20 rows of a dataframe called df?
df.head(20)
What does the .info() method called on a pandas dataframe show?
The object type, column names data types, size of dataframe in memory, column counts, and range of row index.
How do you print the column names of a pandas dataframe, df?
df.columns
How do you iterate over columns of a dataframe, df, and remove spaces?
df.columns.str.replace(“ “, “”)
How do you return basic summary statistics for the continuous variables in a pandas dataframe, df, and what does this method show?
df.describe():
Row count Mean Std Min 25% Quantile 50% Quantile 75% Quantile Max
How do you print a column (series) of a dataframe, df?`
What does this command also return?
df[‘column name’]
df.column_name
This will also return the number of rows as length and the datatype.
How do you change the data type of a pandas dataframe, df, column?
df.column.astype(‘x’)
Use a lambda function to remove the left whitespace of all rows of dataframe column’s categories, df, and remove all periods from category names
df.column.apply(lambda x: x.lstrip().replace(“.”,””))
How would you show how many cars were manufactured for each origin country in the cars dataset, for every year?
pd.crosstab(cars.year, cars.brand)
What are two methods for returning rows and columns from a dataframe, df, with a specified condition?
df. loc[]
df. iloc[]
How do you return the sum of all NaN values from columns in a dataframe, df?
df.isna().sum()
How do you fill all of the NaN values of the cylinders column with the rounded mean of that column?
cars.cylinders.fillna(round(np.nanmean(cars.cylinders)), inplace = True)
How do you calculate the mode for a categorical column, brand, excluding “Missing” category for the brand column?
Then how do you replace all of the “Missing” values with the computed mode value?
Then how do you remove the “Missing” category?
mode_brand = cars.brand[cars.brand != “Missing”].mode()
cars. brand[cars.brand == “Missing” = mode_brand[0]
cars. brand.cat.remove_unused_categories()
How do you calculate the min and max of a pandas dataframe, df, using numpy?
np. min(df.column)
np. max(df.column)
How do you perform min-max normalization?
X(min-max) = X - X(min) / X(max) - X(min)