Pandas Merge Join Flashcards
pd.merge()
The pd.merge() function implements a number of types of joins: the one-to-one, many-to-one, and many-to-many joins. All three types of joins are accessed via an identical call to the pd.merge() interface; the type of join performed depends on the form of the input data.
pd.merge(df1, df2)
The pd.merge() function recognizes that each DataFrame has an “employee” column, and automatically joins using this column as a key. The result of the merge is a new DataFrame that combines the information from the two inputs. Notice that the order of entries in each column is not necessarily maintained: in this case, the order of the “employee” column differs between df1 and df2, and the pd.merge() function correctly accounts for this. Additionally, keep in mind that the merge in general discards the index, except in the special case of merges by index (see the left_index and right_index keywords, discussed momentarily).
pd.merge(df3, df4)
Many-to-one joins are joins in which one of the two key columns contains duplicate entries. For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate. Consider the following example of a many-to-one join:
Many-to-many joins
pd.merge(df1, df5)
where both data sets contain duplicate columns
Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined. If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge. This will be perhaps most clear with a concrete example. Consider the following, where we have a DataFrame showing one or more skills associated with a particular group. By performing a many-to-many join, we can recover the skills associated with any individual person:
0 Bob Accounting math 1 Bob Accounting spreadsheets 2 Jake Engineering coding 3 Jake Engineering linux 4 Lisa Engineering coding 5 Lisa Engineering linux 6 Sue HR spreadsheets 7 Sue HR organization
pd.merge(df1, df2, on=’employee’)
Most simply, you can explicitly specify the name of the key column using the on keyword, which takes a column name or a list of column names:
pd.merge(df1, df3, left_on=”employee”, right_on=”name”)
At times you may wish to merge two datasets with different column names; for example, we may have a dataset in which the employee name is labeled as “name” rather than “employee”. In this case, we can use the left_on and right_on keywords to specify the two column names:
pd.merge(df1, df3, left_on=”employee”, right_on=”name”).drop(‘name’, axis=1)
The result has a redundant column that we can drop if desired–for example, by using the drop() method of DataFrames:
The left_index and right_index keywords
df1a = df1.set_index(‘employee’)
df2a = df2.set_index(‘employee’)
Sometimes, rather than merging on a column, you would instead like to merge on an index. For example, your data might look like this:
pd.merge(df1a, df2a, left_index=True, right_index=True)
You can use the index as the key for merging by specifying the left_index and/or right_index flags in pd.merge():
‘df1a’, ‘df2a’, ‘df1a.join(df2a)’
For convenience, DataFrames implement the join() method, which performs a merge that defaults to joining on indices:
‘df1a’, ‘df3’, “pd.merge(df1a, df3, left_index=True, right_on=’name’)
If you’d like to mix indices and columns, you can combine left_index with right_on or left_on with right_index to get the desired behavior:
Specifying Set Arithmetic for Joins
In all the preceding examples we have glossed over one important consideration in performing a join: the type of set arithmetic used in the join. This comes up when a value appears in one key column but not the other.
Here we have merged two datasets that have only a single “name” entry in common: Mary. By default, the result contains the intersection of the two sets of inputs; this is what is known as an inner join. We can specify this explicitly using the how keyword, which defaults to “inner”:
‘df6’, ‘df7’, “pd.merge(df6, df7, how=’outer’)
Other options for the how keyword are ‘outer’, ‘left’, and ‘right’. An outer join returns a join over the union of the input columns, and fills in all missing values with NAs:
pd.merge(df6, df7, how=’left’)
The left join and right join return joins over the left entries and right entries, respectively. For example
Overlapping Column Names
pd.merge(df8, df9, on=”name”)
Because the output would have two conflicting column names, the merge function automatically appends a suffix _x or _y to make the output columns unique. If these defaults are inappropriate, it is possible to specify a custom suffix using the suffixes keyword: