Data wrangling Flashcards
Replace missing values with naN
# replace "?" to NaN df.replace("?", np.nan, inplace = True)
.replace(A, B, inplace = True)
Identify missing data
The missing values are converted to default. We use the following functions to identify these missing values. There are two methods to detect missing data:
.isnull()
.notnull()
missing_data = df.isnull()
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.
Count missing values in each column
for column in missing_data.columns.values.tolist():
print(column)
print (missing_data[column].value_counts())
print(“”)
See which value is most common
df[‘num-of-doors’].value_counts().idxmax()
Drop rows
# simply drop whole row with NaN in "price" column df.dropna(subset=["price"], axis=0, inplace=True)
axis = 0 drops the entire row axis = 1 drops the entire column
Reset index
# reset index, because we droped two rows df.reset_index(drop=True, inplace=True)
List data types for each column
df.dtypes
Convert data types to proper format
use .dtypes() to identify data types
use .astype() to convert data type
df[[“bore”, “stroke”]] = df[[“bore”, “stroke”]].astype(“float”)
df[[“normalized-losses”]] = df[[“normalized-losses”]].astype(“int”)
df[[“price”]] = df[[“price”]].astype(“float”)
df[[“peak-rpm”]] = df[[“peak-rpm”]].astype(“float”)
Rename a column
rename column name from “highway-mpg” to “highway-L/100km”
df.rename(columns={‘“highway-mpg”’:’highway-L/100km’}, inplace=True)
Design the bins
bins = np.linspace(min(df[“horsepower”]), max(df[“horsepower”]), 4)
bins
Apply the cut function to a certain column
df[‘horsepower-binned’] = pd.cut(df[‘horsepower’], bins, labels=group_names, include_lowest=True )
df[[‘horsepower’,’horsepower-binned’]].head(20)
Assign dummy variables to each column
dummy_variable_1 = pd.get_dummies(df[“fuel-type”])
dummy_variable_1.head()
Change column names
dummy_variable_1.rename(columns={‘gas’:’fuel-type-gas’, ‘diesel’:’fuel-type-diesel’}, inplace=True)
dummy_variable_1.head()
merge the data frams
# merge data frame "df" and "dummy_variable_1" df = pd.concat([df, dummy_variable_1], axis=1)
# drop original column "fuel-type" from "df" df.drop("fuel-type", axis = 1, inplace=True)
Methods of normalizing data
(1) simple feature scaling: Xnew = X old/Xmax (2) Min-Max: Xnew = (Xold - Xmin) / (Xmax - Xmin) (3) Z-score: Xnew = (Xold - mean)/SD