Data wrangling Flashcards

Question 1

Q

Replace missing values with naN

Answer

A

# replace "?" to NaN
df.replace("?", np.nan, inplace = True)

.replace(A, B, inplace = True)

Question 2

Q

Identify missing data

Answer

A

The missing values are converted to default. We use the following functions to identify these missing values. There are two methods to detect missing data:

.isnull()
.notnull()

missing_data = df.isnull()

The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.

Question 3

Q

Count missing values in each column

Answer

A

for column in missing_data.columns.values.tolist():
print(column)
print (missing_data[column].value_counts())
print(“”)

Question 4

Q

See which value is most common

Answer

A

df[‘num-of-doors’].value_counts().idxmax()

Question 5

Q

Drop rows

Answer

A

# simply drop whole row with NaN in "price" column
df.dropna(subset=["price"], axis=0, inplace=True)

axis = 0 drops the entire row 
axis = 1 drops the entire column

Question 6

Q

Reset index

Answer

A

# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)

Question 7

Q

List data types for each column

Answer

A

df.dtypes

Question 8

Q

Convert data types to proper format

use .dtypes() to identify data types
use .astype() to convert data type

Answer

A

df[[“bore”, “stroke”]] = df[[“bore”, “stroke”]].astype(“float”)

df[[“normalized-losses”]] = df[[“normalized-losses”]].astype(“int”)

df[[“price”]] = df[[“price”]].astype(“float”)

df[[“peak-rpm”]] = df[[“peak-rpm”]].astype(“float”)

Question 9

Q

Rename a column

Answer

A

rename column name from “highway-mpg” to “highway-L/100km”

df.rename(columns={‘“highway-mpg”’:’highway-L/100km’}, inplace=True)

Question 10

Q

Design the bins

Answer

A

bins = np.linspace(min(df[“horsepower”]), max(df[“horsepower”]), 4)
bins

Question 11

Q

Apply the cut function to a certain column

Answer

A

df[‘horsepower-binned’] = pd.cut(df[‘horsepower’], bins, labels=group_names, include_lowest=True )
df[[‘horsepower’,’horsepower-binned’]].head(20)

Question 12

Q

Assign dummy variables to each column

Answer

A

dummy_variable_1 = pd.get_dummies(df[“fuel-type”])

dummy_variable_1.head()

Question 13

Q

Change column names

Answer

A

dummy_variable_1.rename(columns={‘gas’:’fuel-type-gas’, ‘diesel’:’fuel-type-diesel’}, inplace=True)
dummy_variable_1.head()

Question 14

Q

merge the data frams

Answer

A

# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis=1)

# drop original column "fuel-type" from "df"
df.drop("fuel-type", axis = 1, inplace=True)

Question 15

Q

Methods of normalizing data

Answer

A

(1) simple feature scaling:
Xnew = X old/Xmax
(2) Min-Max:
Xnew = (Xold - Xmin) / (Xmax - Xmin) 
(3) Z-score:
Xnew = (Xold - mean)/SD

Question 16

Q

Simple feature scaling code

Answer

Study These Flashcards

A

df [“length”] = df [ “length” ] / df [“length”].max()

Question 17

Q

Min-max in Python

Answer

Study These Flashcards

A

df [“length”] = (df [“length”] - df [“length”].min())/ (df[“length”].max() - df[“length”].min())

Question 18

Q

Z score

Answer

Study These Flashcards

A

df[“length”] = (df[“length”]-df[“length”].mean())/df[“length”].std()

Data wrangling Flashcards

(18 cards)