manipulation Flashcards

1
Q

Union two datasets

A

pd.concat([df1, df2], ignore_index = True)

ignore_index is set so that the index continues counting and doesn’t restart at 0 when new dataset starts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Union two datasets with different column names (and keep only the common columns)

A

pd.concat([df1, df2], join = ‘inner’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

union two datasets with different columns (and keep all columns)

A

pd.concat([df1, df2], sort = True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Inner join

A

df = df1.merge(df2, on = “key”, how = “inner”)

Inner join is the default for .merge() so not necessary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Inner join and add to the variables the original table as source

A

df = df1.merge(df2, on = “key”, how = “inner”, suffixes = (“_t1”, “_t2”))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Left join with datasets where key is named differently

A

df = df1.merge(df2, left_on = “left_key”, right_on = “right_key”, how = “left”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Left join three datasets

A

df = df1.merge(df2, on = “key”, how = “left”) \

.merge(df3, on = “key”, how = “left”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Full join

A

df = df1.merge(df2, on = “key”, how = “outer”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Manipulate a wide into a long dataset

Example scores per year:
var1 var2 2016 2017 2018

A

df = df.melt(id_vars = [‘var1’, ‘var2’], var_name = [‘years’], value_name = “score”]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Find complete duplicates, and drop complete duplicates

A

Find duplicates:
df[df.duplicated()]

Drop duplicates:
df.drop_duplicates(inplace = True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Hardcode with upper limit

ex: check if any score above 10, if yes replace with 10

A

df.loc[df[“var”] > 10, “var”] = 10

First part of loc identifies the rows (where df[‘var’] > 10), second part the column ‘var’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Sort the dataframe based on a variable

A

df.sort_values(by = “id”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Drop rows via filtering

Example: drop all rows where var < 0

A

df = df[df[“var] > 0]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Drop row via drop statement

Example: all cases where var < 0

A

df.drop(df[df[‘var’] < 0].index, inplace = True)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Change string into integer

A

df[‘var’] = df[‘var’].astype(‘int’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Change integer into categorical

A

df[‘var’] = df[‘var’].astype(‘category’)

17
Q

Change string into date

A

df[‘date”] = pd.to_datetime(df[‘date’]).dt.date

18
Q

Remove $ sign from a string variable

A

df[“var”] = df[“var”].str.strip(‘$’)

19
Q

How to continue code on the next line

A

\

20
Q

Find all cases with a date in the future

A

import datetime as dt

df[df[‘date’] > dt.date.today()]

21
Q

Today’s date

A

import datetime as dt

today_date = dt.date.today()

22
Q

Get indication on type of join to avoid additional rows

A

.merge(validate = ‘one_to_one’)

of ook:
‘one_to_many’
‘many_to_many’
‘many_to_one’

23
Q

Joins - add indicator from in which table(s) the case was present

A

.merge(indicator = True)

24
Q

Check if value is in a list of ID’s from another dataset

A

df[‘id’].isin(df2[‘id’]

25
Q

Method to join time series

A

.merge_ordered()

Main difference with .merge -> default outer join.
More options to join based on dates and nearest match

26
Q

Join time series data with data lag on missing values

A

pd.merge_ordered(df1, df2, on = “key”, fill_method = “ffill”

27
Q

Join time series on nearest matching date

A

pd.merge_asof(df1, df2, on = “key”)

direction can be specified.
tables must be presorted on key

28
Q

SQL query in python

A

df.query(‘VAR > 0’)

29
Q

Sort the dataframe descending

A

df.sort_values(“var”, ascending = False)

30
Q

Sorting by two different variables, the first ascending, the second descending

A

df.sort_values([“var1”, “var2”], ascending = [True, False])

31
Q

Subsetting a column

A

df[“var”]

32
Q

Subsetting multiple columns

A

df[[“var1”, “var2”]]

33
Q

Subsetting rows

  • greater than zero
  • part of a list
A

df[df[‘var”] > 0]

df[df[‘var’].isin([‘option1’, ‘option2’])]

34
Q

Setting an index

A

df = df.set_index(‘var’)

35
Q

Removing an index

A

df = df.reset_index()

index will become a variable again unless argument (drop = True) is used

36
Q

Two ways to select a column

A

df[‘var’]
df.var

if var has only letters/numbers/underscore

37
Q

Select two columns

A

df[[‘col1’, ‘col2’]]

38
Q

Difference between df[‘var’] and df[[‘var’]]

A

df[‘var’] will return the column as a pandas series

df[[‘var’]] will return the column as a pandas dataframe

39
Q

Difference loc and iloc

A

Loc is label-based: specify rows and columns by their row and column labels
iloc is index based: specifcy rows and columns by their integer index (zero based indexing!)

df. loc[‘BE’, ‘population’] –> selects row with index BE and column population
df. loc[[‘BE’, ‘NL’], ‘population’]

df.iloc[1, 0]