Data wrangling Flashcards

1
Q

Replace missing values with naN

A
# replace "?" to NaN
df.replace("?", np.nan, inplace = True)

.replace(A, B, inplace = True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Identify missing data

A

The missing values are converted to default. We use the following functions to identify these missing values. There are two methods to detect missing data:

.isnull()
.notnull()

missing_data = df.isnull()

The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Count missing values in each column

A

for column in missing_data.columns.values.tolist():
print(column)
print (missing_data[column].value_counts())
print(“”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

See which value is most common

A

df[‘num-of-doors’].value_counts().idxmax()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Drop rows

A
# simply drop whole row with NaN in "price" column
df.dropna(subset=["price"], axis=0, inplace=True)
axis = 0 drops the entire row 
axis = 1 drops the entire column
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Reset index

A
# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

List data types for each column

A

df.dtypes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Convert data types to proper format

use .dtypes() to identify data types
use .astype() to convert data type

A

df[[“bore”, “stroke”]] = df[[“bore”, “stroke”]].astype(“float”)

df[[“normalized-losses”]] = df[[“normalized-losses”]].astype(“int”)

df[[“price”]] = df[[“price”]].astype(“float”)

df[[“peak-rpm”]] = df[[“peak-rpm”]].astype(“float”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Rename a column

A

rename column name from “highway-mpg” to “highway-L/100km”

df.rename(columns={‘“highway-mpg”’:’highway-L/100km’}, inplace=True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Design the bins

A

bins = np.linspace(min(df[“horsepower”]), max(df[“horsepower”]), 4)
bins

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Apply the cut function to a certain column

A

df[‘horsepower-binned’] = pd.cut(df[‘horsepower’], bins, labels=group_names, include_lowest=True )
df[[‘horsepower’,’horsepower-binned’]].head(20)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Assign dummy variables to each column

A

dummy_variable_1 = pd.get_dummies(df[“fuel-type”])

dummy_variable_1.head()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Change column names

A

dummy_variable_1.rename(columns={‘gas’:’fuel-type-gas’, ‘diesel’:’fuel-type-diesel’}, inplace=True)
dummy_variable_1.head()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

merge the data frams

A
# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis=1)
# drop original column "fuel-type" from "df"
df.drop("fuel-type", axis = 1, inplace=True)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Methods of normalizing data

A
(1) simple feature scaling:
Xnew = X old/Xmax
(2) Min-Max:
Xnew = (Xold - Xmin) / (Xmax - Xmin) 
(3) Z-score:
Xnew = (Xold - mean)/SD
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Simple feature scaling code

A

df [“length”] = df [ “length” ] / df [“length”].max()

17
Q

Min-max in Python

A

df [“length”] = (df [“length”] - df [“length”].min())/ (df[“length”].max() - df[“length”].min())

18
Q

Z score

A

df[“length”] = (df[“length”]-df[“length”].mean())/df[“length”].std()