manipulation Flashcards by Szn B

Union two datasets

pd.concat([df1, df2], ignore_index = True)

ignore_index is set so that the index continues counting and doesn’t restart at 0 when new dataset starts

How well did you know this?

Not at all

Perfectly

Union two datasets with different column names (and keep only the common columns)

pd.concat([df1, df2], join = ‘inner’)

How well did you know this?

Not at all

Perfectly

union two datasets with different columns (and keep all columns)

pd.concat([df1, df2], sort = True)

How well did you know this?

Not at all

Perfectly

Inner join

df = df1.merge(df2, on = “key”, how = “inner”)

Inner join is the default for .merge() so not necessary

How well did you know this?

Not at all

Perfectly

Inner join and add to the variables the original table as source

df = df1.merge(df2, on = “key”, how = “inner”, suffixes = (“_t1”, “_t2”))

How well did you know this?

Not at all

Perfectly

Left join with datasets where key is named differently

df = df1.merge(df2, left_on = “left_key”, right_on = “right_key”, how = “left”)

How well did you know this?

Not at all

Perfectly

Left join three datasets

df = df1.merge(df2, on = “key”, how = “left”) \

.merge(df3, on = “key”, how = “left”)

How well did you know this?

Not at all

Perfectly

Full join

df = df1.merge(df2, on = “key”, how = “outer”)

How well did you know this?

Not at all

Perfectly

Manipulate a wide into a long dataset

Example scores per year:
var1 var2 2016 2017 2018

df = df.melt(id_vars = [‘var1’, ‘var2’], var_name = [‘years’], value_name = “score”]

How well did you know this?

Not at all

Perfectly

Find complete duplicates, and drop complete duplicates

Find duplicates:
df[df.duplicated()]

Drop duplicates:
df.drop_duplicates(inplace = True)

How well did you know this?

Not at all

Perfectly

Hardcode with upper limit

ex: check if any score above 10, if yes replace with 10

df.loc[df[“var”] > 10, “var”] = 10

First part of loc identifies the rows (where df[‘var’] > 10), second part the column ‘var’

How well did you know this?

Not at all

Perfectly

Sort the dataframe based on a variable

df.sort_values(by = “id”)

How well did you know this?

Not at all

Perfectly

Drop rows via filtering

Example: drop all rows where var < 0

df = df[df[“var] > 0]

How well did you know this?

Not at all

Perfectly

Drop row via drop statement

Example: all cases where var < 0

df.drop(df[df[‘var’] < 0].index, inplace = True)

How well did you know this?

Not at all

Perfectly

Change string into integer

df[‘var’] = df[‘var’].astype(‘int’)

How well did you know this?

Not at all

Perfectly

Change integer into categorical

Study These Flashcards

df[‘var’] = df[‘var’].astype(‘category’)

Change string into date

Study These Flashcards

df[‘date”] = pd.to_datetime(df[‘date’]).dt.date

Remove $ sign from a string variable

Study These Flashcards

df[“var”] = df[“var”].str.strip(‘$’)

How to continue code on the next line

Study These Flashcards

Find all cases with a date in the future

Study These Flashcards

import datetime as dt

df[df[‘date’] > dt.date.today()]

Today’s date

Study These Flashcards

import datetime as dt

today_date = dt.date.today()

Get indication on type of join to avoid additional rows

Study These Flashcards

.merge(validate = ‘one_to_one’)

of ook:
‘one_to_many’
‘many_to_many’
‘many_to_one’

Joins - add indicator from in which table(s) the case was present

Study These Flashcards

.merge(indicator = True)

Check if value is in a list of ID’s from another dataset

Study These Flashcards

df[‘id’].isin(df2[‘id’]

Method to join time series

.merge_ordered() Main difference with .merge -> default outer join. More options to join based on dates and nearest match

Join time series data with data lag on missing values

pd.merge_ordered(df1, df2, on = "key", fill_method = "ffill"

Join time series on nearest matching date

pd.merge_asof(df1, df2, on = "key") direction can be specified. tables must be presorted on key

SQL query in python

df.query('VAR > 0')

Sort the dataframe descending

df.sort_values("var", ascending = False)

Sorting by two different variables, the first ascending, the second descending

df.sort_values(["var1", "var2"], ascending = [True, False])

Subsetting a column

df["var"]

Subsetting multiple columns

df[["var1", "var2"]]

Subsetting rows - greater than zero - part of a list

df[df['var"] > 0] df[df['var'].isin(['option1', 'option2'])]

Setting an index

df = df.set_index('var')

Removing an index

df = df.reset_index() index will become a variable again unless argument (drop = True) is used

Two ways to select a column

df['var'] df.var if var has only letters/numbers/underscore

Select two columns

df[['col1', 'col2']]

Difference between df['var'] and df[['var']]

df['var'] will return the column as a pandas series | df[['var']] will return the column as a pandas dataframe

Difference loc and iloc

Loc is label-based: specify rows and columns by their row and column labels iloc is index based: specifcy rows and columns by their integer index (zero based indexing!) df. loc['BE', 'population'] --> selects row with index BE and column population df. loc[['BE', 'NL'], 'population'] df.iloc[1, 0]

manipulation Flashcards

(39 cards)