Pandas Merge Join Flashcards

1
Q

pd.merge()

A

The pd.merge() function implements a number of types of joins: the one-to-one, many-to-one, and many-to-many joins. All three types of joins are accessed via an identical call to the pd.merge() interface; the type of join performed depends on the form of the input data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

pd.merge(df1, df2)

A

The pd.merge() function recognizes that each DataFrame has an “employee” column, and automatically joins using this column as a key. The result of the merge is a new DataFrame that combines the information from the two inputs. Notice that the order of entries in each column is not necessarily maintained: in this case, the order of the “employee” column differs between df1 and df2, and the pd.merge() function correctly accounts for this. Additionally, keep in mind that the merge in general discards the index, except in the special case of merges by index (see the left_index and right_index keywords, discussed momentarily).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

pd.merge(df3, df4)

A

Many-to-one joins are joins in which one of the two key columns contains duplicate entries. For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate. Consider the following example of a many-to-one join:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Many-to-many joins
pd.merge(df1, df5)

where both data sets contain duplicate columns

A

Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined. If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge. This will be perhaps most clear with a concrete example. Consider the following, where we have a DataFrame showing one or more skills associated with a particular group. By performing a many-to-many join, we can recover the skills associated with any individual person:

0	Bob	Accounting	math
1	Bob	Accounting	spreadsheets
2	Jake	Engineering	coding
3	Jake	Engineering	linux
4	Lisa	Engineering	coding
5	Lisa	Engineering	linux
6	Sue	HR	spreadsheets
7	Sue	HR	organization
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

pd.merge(df1, df2, on=’employee’)

A

Most simply, you can explicitly specify the name of the key column using the on keyword, which takes a column name or a list of column names:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

pd.merge(df1, df3, left_on=”employee”, right_on=”name”)

A

At times you may wish to merge two datasets with different column names; for example, we may have a dataset in which the employee name is labeled as “name” rather than “employee”. In this case, we can use the left_on and right_on keywords to specify the two column names:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

pd.merge(df1, df3, left_on=”employee”, right_on=”name”).drop(‘name’, axis=1)

A

The result has a redundant column that we can drop if desired–for example, by using the drop() method of DataFrames:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

The left_index and right_index keywords
df1a = df1.set_index(‘employee’)
df2a = df2.set_index(‘employee’)

A

Sometimes, rather than merging on a column, you would instead like to merge on an index. For example, your data might look like this:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

pd.merge(df1a, df2a, left_index=True, right_index=True)

A

You can use the index as the key for merging by specifying the left_index and/or right_index flags in pd.merge():

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

‘df1a’, ‘df2a’, ‘df1a.join(df2a)’

A

For convenience, DataFrames implement the join() method, which performs a merge that defaults to joining on indices:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

‘df1a’, ‘df3’, “pd.merge(df1a, df3, left_index=True, right_on=’name’)

A

If you’d like to mix indices and columns, you can combine left_index with right_on or left_on with right_index to get the desired behavior:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Specifying Set Arithmetic for Joins

In all the preceding examples we have glossed over one important consideration in performing a join: the type of set arithmetic used in the join. This comes up when a value appears in one key column but not the other.

A

Here we have merged two datasets that have only a single “name” entry in common: Mary. By default, the result contains the intersection of the two sets of inputs; this is what is known as an inner join. We can specify this explicitly using the how keyword, which defaults to “inner”:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

‘df6’, ‘df7’, “pd.merge(df6, df7, how=’outer’)

A

Other options for the how keyword are ‘outer’, ‘left’, and ‘right’. An outer join returns a join over the union of the input columns, and fills in all missing values with NAs:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

pd.merge(df6, df7, how=’left’)

A

The left join and right join return joins over the left entries and right entries, respectively. For example

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Overlapping Column Names

pd.merge(df8, df9, on=”name”)

A

Because the output would have two conflicting column names, the merge function automatically appends a suffix _x or _y to make the output columns unique. If these defaults are inappropriate, it is possible to specify a custom suffix using the suffixes keyword:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

pd.merge(df8, df9, on=”name”, suffixes=[“_L”, “_R”])

A

These suffixes work in any of the possible join patterns, and work also if there are multiple overlapping columns.