Cleaning Data Flashcards
Learn the basics of Data Analysis
5 steps of data analysis
S1. Ask questions
S2. Wrangle data - gather, assess, clean
S3. Perform EDA (exploratory data analysis)
S4. Draw conclusions
S5. Communicate your results
parts of .read_csv()
filepath = filepath of the csv
header = header of the CSV file
index_col = which columns should be index
sep = separator
ex.
labels = [‘id’,’name’,’scores’]
df = pd.read_csv(‘scores.csv’, sep=’;’, header=None, names=labels, index_col=’Name’)
.shape
returns the overall size/shape of the dataframe: (462, 32)
.dtypes
return each datatype of the columns
.info()
returns
column, non-null count, and Dtype
.nunique()
returns number of unique values for each column
.describe()
returns summary statistics: count, mean, std, min, max, Q1 and Q3
.head()
returns first few lines of Dataframe
df.head(5)
.tail()
returns last few lines of Dataframe
df.tail(2)
A. .iloc
B. .iloc[0]
C. .iloc[0:5]
D. .iloc[:,1]
E. .iloc[0:5,1:3]
F. radi_means = .iloc[0:3,2]
A. a method in pandas used for integer-location based on indexing
B. Selects the first row
C. Selects the first five rows
D. Selects the second column (index 1)
E. Selects the first five rows and the second and third columns
F. Outputs first three rows of column three.
A. .loc
B. .loc[‘row_index’]
C. .loc[[‘row_label’,’row_label2’]]
D. .loc[:, ‘column_label’]
E. .loc[‘row_label1’:’row_label3’,[‘col_lab1’,’column_lab2’]]
A. A method in pandas used for label-based indexing.
B. Selects the row with the specified label
C. Selects multiple rows
D. Selects all rows for the specified column
E. Selects a range of rows and specific columns
A. Get the diagnosis of the patient with id 842302
B. Get the radius_mean for patients with id 842302 and 842517
A. diagnosis_patient = df.loc[df[‘id’]==842302,’diagnosis’]
B. radius_means = df.loc[df[‘id’].isin([842302,842517]), ‘radius_mean’] #Output
Visualize how to create a subset of a dataframe
make a copy of the data frame with relevant columns
copy_df = df[[‘col_1’,’col_2’,’col_3’,col_4’]]
subset = df[df[‘col_1’] == ‘data’]
Visualize how to use subsets in dropping duplicates, filling NaNs, and .isin() method.
A. Dropping duplicates
unique_df = df.drop_duplicates(subset=[‘col1’, ‘col2’])
B. Fill missing values
df.fillna(value=0, subset=[‘col3’], inplace=True)
C. Using pd.DataFrame.isin()
filtered_df = df[df[‘col1’].isin([‘data1’,’data2’])]
Fill null values in your DataFrame with values
.fillna()
ex.
1. df[‘col1’] = df[‘col1’].fillna(pd.to_datetime(‘1900-01-01’)
2. mean_date = df[‘col1’].mean()
df[‘col1’] = df[‘col1’].fillna(mean_date)
A. Using .query() to discover null values in rows of a column
B. Using .query() for all rows with null values
A.
1. df_null = df[df[‘col1’].isnull()]
2. df_null = df.query(‘col1.isnull()’)
B.
df_null = df[df.isnull().any(axis=1)]
Clean data
A. .duplicated() - True or false for rows that are duplicates = df.duplicated() and then .drop_duplicates(): df = df.drop_duplicates()
B. Converting to right dtypes: .to_datetime(): df[‘time’] = pd.to_datetime(df[‘time’])
C. .isnull(): df[‘col’].isnull()
D. .fillna() : df[‘col’] = df[‘col’].fillna(mean)
E. Correcting column names
F. Merging dataframes
G. Saving the cleaned version of the dataframes.