VA Session 7 ^ Flashcards
Subsetting & Slicing methods with pandas
- loc
- iloc
df.loc[]
Access rows or columns by labels or Boolean array; slices: start & stop included
df.loc[]:
- elements
- row with label 1 as series
- row with label 1 as Data Frame
- rows from start (0) to end (5)
- all rows & named column
- df.loc[“row_name”, “column_name”]
- df.loc[“A”]
- df.loc[[“A”]]
- df. loc[“A”:”C”]
- df.loc[:,“Nr-in_Alphabet”]
df.iloc[]
- Access rows or columns by their integer position or with Boolean array (position go from 0 to length -1)
- slice: end number not included
- faster for selecting rows than loc
- df.iloc[]
- elements
- row at position 1 as series
- row at position 1 as Data Frame
- rows from index (0) to (4)
- all rows & first two columns
- df.iloc[row_index, colum_index]
- df.iloc[1]
- df.iloc[[1]]
- df.iloc[0:5]
- df.iloc[:,0:2]
Subsetting rows with index 0 to 4
df[0:5]
Subset columns
- df[“column_name”]
- df[[“col_name1”, “col_name2”]]
Subset rows & columns
df[0:5] [[“col_name1”, “col_name2”]]
Subsetting vs referencing of datasets
- Copying: df.copy() -> Create new true copy of Data Frame
- Referencing: subsetting & storing results in new Data Frame -> new one still referencing the original DataFrame: when changing original data, also copied would be changed
Filtering data: Subset Dataframe’s rows or columns according to specified row or column labels
df.filter(like=”culmen”, axis = 1)
Cleaning data - Checking & removing duplicates
- Duplicates: already exist in stored data or created when merging datasets
- df.duplicated(): Check duplicates row-wise
- df.nunique(): Check duplicates column-wise
- df.drop_duplicates(): Remove duplicate rows
Cleaning data - Remapping values (e.g. due to faulty data or analysis requires transformation)
df.replace()
Cleaning data - Dealing with text
- Common issues in text data quality -> more categories than actually exist
- e.g. “Copenhagen”, “COPENHAGEN”, “Copenhagen
- df[“column”].str.strip(): Remove spaces
- df[“column”].str.upper()
- df[“column”].str.upper(): Change capitalization
Reshaping data - Unpivot DataFrame from wide to long format
melt()
wide_to_long()
Reshaping data - Change from long to wide format
pivot()