Joining Data with pandas Flashcards
Inner join
Merging is same as joining.
Inner join - retutns rows that that having matching values in both data. Basically an intersetmn
wards_census = wards.merge(census, on=’ward’)
print(wards_census.head(4))
Suffixes
wards_census = wards.merge(census, on=’ward’, suffixes=(‘_cen’, ‘_ward’))
print(wards_census.head())
print(wards_census.shape)
One-to-many relationships
One-To-One = Every row in the left table is related to only one row in the right table One-To-Many = Every row in left table is related to one or more rows in the right table
One-to-many example
ward_licenses = wards.merge(licenses, on=’ward’, suffixes=(‘_ward’, ‘_lic’))
print(ward_licenses.head())
Merging multiple DataFrames
Theoretical merge
grants_licenses = grants.merge(licenses, on=’zip’)
print(grants_licenses.loc[grants_licenses[‘business’]==”REGGIE’S BAR & GRILL”, [‘grant’,’company’,’account’,’ward’,’business’]])
Single merge
grants.merge(licenses, on=[‘address’, ‘zip’])
Merging multiple tables
grants_licenses_ward = grants.merge(licenses, on=[‘address’,
‘zip’]) \
.merge(wards, on=’ward’, suffixes=(‘_bus’, ‘_ward’))
grants_licenses_ward.head()
Merging even more…
Three tables:
df1.merge(df2, on=’col’) \
.merge(df3, on=’col’)
Four tables:
df1.merge(df2, on=’col’) \
.merge(df3, on=’col’) \
.merge(df4, on=’col’)
Left join
Inner join is default behavior. Inner join refers to the join that produces the rows of data with matching values in the key columns of both tables. Essentially , it is the data common to both tables (intersection of two sets)
Left Join - returns all rows of data from the left table and rows from the right table where the key columns match. The LEFT JOIN of two tables is nothing more than finding the set difference or the relative complement of the two tables.
Merge with leftjoin
movies_taglines = movies.merge(taglines, on=’id’, how=
‘left’)
print(movies_taglines.head())
Other joins
Right Join - It is the mirror opposite of the left join. Only rows from the left table where the key columns match and ll rows of data on the right table are returned.
Filtering the data
m = movie_to_genres[‘genre’] == ‘TV Movie’
tv_genre = movie_to_genres[m]
Merge with right join
tv_movies = movies.merge(tv_genre, how=’right’,
left_on=’id’, right_on=’movie_id’)
print(tv_movies.head())
Outer Join - Returns all rows from both tables and null for values where the key columns used to join the tables has no match.
Merge with outer join
family_comedy = family.merge(comedy, on=’movie_id’, how=’outer’, suffixes=(‘_fam’, ‘_com’))
print(family_comedy)
Merging a table to itself
Aka self join -
Merging a table to itself (inner)
original_sequels = sequels.merge(sequels, left_on=
‘sequel’, right_on=’id’, suffixes=(‘_org’, ‘_seq’))
print(original_sequels.head())
Merging a table to itself with leftjoin
original_sequels = sequels.merge(sequels, left_on=
‘sequel’, right_on=’id’, how=’left’, suffixes=(‘_org’, ‘_seq’))
print(original_sequels.head())
Common situations when to merge attable to itself:
Hierarchical relationships
Sequential relationships
Graph data
Merging on indexes
Setting an index
movies = pd.read_csv(‘tmdb_movies.csv’, index_col=[‘id’])
Merging on index
movies_taglines = movies.merge(taglines, on=’id’, how=
‘left’)
MultiIndex merge samuel_casts = samuel.merge(casts, on=['movie_id', 'cast_id']) print(samuel_casts.head()) print(samuel_casts.shape)
Index merge with left_on and right_on
movies_genres = movies.merge(movie_to_genres, left_on=’id’, left_index=True,right_on=’movie_id’, right_index=True)
print(movies_genres.head())
Filtering joins
Mutating versus ltering joins
Mutating joins: Combines data from two tables based on matching observations in both tables
Filtering joins: Filter observations from table based on whether or not they match an observation in another
table
Semi-joins
Returns the intersection, similar to an inner join
Returns only columns from the left table and not the right
No duplicates
Semi-join
genres_tracks = genres.merge(top_tracks, on=’gid’)
top_genres = genres[genres[‘gid’].isin(genres_tracks[‘gid’])]
print(top_genres.head())
Anti-join:
Returns the left table, excluding the intersection
Returns only columns from the left table and not the right
Anti-join
genres_tracks = genres.merge(top_tracks, on=’gid’, how=
‘left’, indicator=True)
gid_list = genres_tracks.loc[genres_tracks[‘_merge’] ==
‘left_only’, ‘gid’]
non_top_genres = genres[genres[‘gid’].isin(gid_list)]
print(non_top_genres.head())
Concatenate DataFrames together vertically
Concatenate two tables vertically
Pandas .concat() method can concatenate both vertical and horizontal.
axis=0 , vertical
Basic concatenation
pd.concat([inv_jan, inv_feb, inv_mar])
Ignoring the index
pd.concat([inv_jan, inv_feb, inv_mar],
ignore_index=True)
Setting labels to original tables
pd.concat([inv_jan, inv_feb, inv_mar], ignore_index=False, keys=[‘jan’, ‘feb’, ‘mar’])
Concatenate tables with different column names
pd.concat([inv_jan, inv_feb],
sort=True)
Concatenate tables with different column names
pd.concat([inv_jan, inv_feb], join=’inner’)
Using append method .append() Simplied version ofthe .concat() method Supports: ignore_index , and sort Does Not Support: keys and join Always join = outer
Append the tables
inv_jan.append([inv_feb, inv_mar], ignore_index=True,
sort=True)
Verifying integrity
Validating merges .merge(validate=None) : Checks if merge is of specified type 'one_to_one' 'one_to_many' 'many_to_one' 'many_to_many'
Merge validate: one_to_one
tracks.merge(specs, on=’tid’,validate=’one_to_one’)
Merge validate: one_to_many
albums.merge(tracks, on=’aid’,validate=’one_to_many’)
Verifying concatenations
.concat(verify_integrity=False) :
Check whether the new concatenated index contains duplicates. Default value is False
Verifying concatenation: example
pd.concat([inv_feb, inv_mar],
verify_integrity=True) - checks whether there are duplicate values in the index
pd.concat([inv_feb, inv_mar],
verify_integrity=False)
Why verify integrity and what to do Why: Real world data is often NOT clean What to do: Fix incorrect data Drop duplicate rows
Using merge_ordered()
Merging stock data
import pandas as pd
pd.merge_ordered(appl, mcd, on=’date’, suffixes=(‘_aapl’, ‘_mcd’))
Forward fill example
pd.merge_ordered(appl, mcd, on=’date’,
suffixes=(‘_aapl’, ‘_mcd’), fill_method=’ffill’)
When to use merge_ordered()?
Ordered data / time series
Filling in missing values
Using merge_asof()
merge_asof() example
pd.merge_asof(visa, ibm, on=’date_time’,
suffixes=(‘_visa’, ‘_ibm’))
merge_asof() example with direction
pd.merge_asof(visa, ibm, on=[‘date_time’],
suffixes=(‘_visa’, ‘_ibm’), direction=’forward’)
When to use merge_asof()
Data sampled from a process
Developing a training set (no data leakage)
Selecting data with .query()
The .query() method
.query(‘SOME SELECTION STATEMENT’)
Accepts an input string
Input string used to determine what rows are returned
Input string similar to statement afterWHERE clause in SQL statement
Prior knowledge of SQL is not necessary
Querying on a single condition
stocks.query(‘nike >= 90’)
Querying on a multiple conditions, “and”, “or”
stocks. query(‘nike > 90 and disney < 140’)
stocks. query(‘nike > 96 or disney < 98’)
Using .query() to select text
stocks_long.query(‘stock==”disney” or (stock==”nike” and close < 90)’)
Reshaping data with.melt()
Example of .melt()
social_fin_tall = social_fin.melt(id_vars=[‘financial’,
‘company’])
print(social_fin_tall.head(10))
Melting with value_vars
social_fin_tall = social_fin.melt(id_vars=[‘financial’,
‘company’], value_vars=[‘2018’, ‘2017’])
print(social_fin_tall.head(9))
Melting with column names
social_fin_tall = social_fin.melt(id_vars=[‘financial’,
‘company’], value_vars=[‘2018’, ‘2017’], var_name=’year’], value_name=’dollars’)
print(social_fin_tall.head(8))