manipulation Flashcards
Union two datasets
pd.concat([df1, df2], ignore_index = True)
ignore_index is set so that the index continues counting and doesn’t restart at 0 when new dataset starts
Union two datasets with different column names (and keep only the common columns)
pd.concat([df1, df2], join = ‘inner’)
union two datasets with different columns (and keep all columns)
pd.concat([df1, df2], sort = True)
Inner join
df = df1.merge(df2, on = “key”, how = “inner”)
Inner join is the default for .merge() so not necessary
Inner join and add to the variables the original table as source
df = df1.merge(df2, on = “key”, how = “inner”, suffixes = (“_t1”, “_t2”))
Left join with datasets where key is named differently
df = df1.merge(df2, left_on = “left_key”, right_on = “right_key”, how = “left”)
Left join three datasets
df = df1.merge(df2, on = “key”, how = “left”) \
.merge(df3, on = “key”, how = “left”)
Full join
df = df1.merge(df2, on = “key”, how = “outer”)
Manipulate a wide into a long dataset
Example scores per year:
var1 var2 2016 2017 2018
df = df.melt(id_vars = [‘var1’, ‘var2’], var_name = [‘years’], value_name = “score”]
Find complete duplicates, and drop complete duplicates
Find duplicates:
df[df.duplicated()]
Drop duplicates:
df.drop_duplicates(inplace = True)
Hardcode with upper limit
ex: check if any score above 10, if yes replace with 10
df.loc[df[“var”] > 10, “var”] = 10
First part of loc identifies the rows (where df[‘var’] > 10), second part the column ‘var’
Sort the dataframe based on a variable
df.sort_values(by = “id”)
Drop rows via filtering
Example: drop all rows where var < 0
df = df[df[“var] > 0]
Drop row via drop statement
Example: all cases where var < 0
df.drop(df[df[‘var’] < 0].index, inplace = True)
Change string into integer
df[‘var’] = df[‘var’].astype(‘int’)