Joining Data with pandas Flashcards

Question 1

Q

Inner join

Answer

A

Merging is same as joining.
Inner join - retutns rows that that having matching values in both data. Basically an intersetmn
wards_census = wards.merge(census, on=’ward’)
print(wards_census.head(4))

Suffixes
wards_census = wards.merge(census, on=’ward’, suffixes=(‘_cen’, ‘_ward’))
print(wards_census.head())
print(wards_census.shape)

Question 2

Q

One-to-many relationships

Answer

A

One-To-One = Every row in the left table is related to only one row in the right table
One-To-Many = Every row in left table is related to one or more rows in the right table

One-to-many example
ward_licenses = wards.merge(licenses, on=’ward’, suffixes=(‘_ward’, ‘_lic’))
print(ward_licenses.head())

Question 3

Q

Merging multiple DataFrames

Answer

A

Theoretical merge
grants_licenses = grants.merge(licenses, on=’zip’)
print(grants_licenses.loc[grants_licenses[‘business’]==”REGGIE’S BAR & GRILL”, [‘grant’,’company’,’account’,’ward’,’business’]])

Single merge
grants.merge(licenses, on=[‘address’, ‘zip’])

Merging multiple tables
grants_licenses_ward = grants.merge(licenses, on=[‘address’,
‘zip’]) \
.merge(wards, on=’ward’, suffixes=(‘_bus’, ‘_ward’))
grants_licenses_ward.head()

Merging even more…
Three tables:
df1.merge(df2, on=’col’) \
.merge(df3, on=’col’)

Four tables:
df1.merge(df2, on=’col’) \
.merge(df3, on=’col’) \
.merge(df4, on=’col’)

Question 4

Q

Left join

Answer

A

Inner join is default behavior. Inner join refers to the join that produces the rows of data with matching values in the key columns of both tables. Essentially , it is the data common to both tables (intersection of two sets)

Left Join - returns all rows of data from the left table and rows from the right table where the key columns match. The LEFT JOIN of two tables is nothing more than finding the set difference or the relative complement of the two tables.

Merge with leftjoin
movies_taglines = movies.merge(taglines, on=’id’, how=
‘left’)
print(movies_taglines.head())

Question 5

Q

Other joins

Answer

A

Right Join - It is the mirror opposite of the left join. Only rows from the left table where the key columns match and ll rows of data on the right table are returned.

Filtering the data
m = movie_to_genres[‘genre’] == ‘TV Movie’
tv_genre = movie_to_genres[m]

Merge with right join
tv_movies = movies.merge(tv_genre, how=’right’,
left_on=’id’, right_on=’movie_id’)
print(tv_movies.head())

Outer Join - Returns all rows from both tables and null for values where the key columns used to join the tables has no match.

Merge with outer join
family_comedy = family.merge(comedy, on=’movie_id’, how=’outer’, suffixes=(‘_fam’, ‘_com’))
print(family_comedy)

Question 6

Q

Merging a table to itself

Answer

A

Aka self join -
Merging a table to itself (inner)
original_sequels = sequels.merge(sequels, left_on=
‘sequel’, right_on=’id’, suffixes=(‘_org’, ‘_seq’))
print(original_sequels.head())

Merging a table to itself with leftjoin
original_sequels = sequels.merge(sequels, left_on=
‘sequel’, right_on=’id’, how=’left’, suffixes=(‘_org’, ‘_seq’))
print(original_sequels.head())

Common situations when to merge attable to itself:
Hierarchical relationships
Sequential relationships
Graph data

Question 7

Q

Merging on indexes

Answer

A

Setting an index
movies = pd.read_csv(‘tmdb_movies.csv’, index_col=[‘id’])

Merging on index
movies_taglines = movies.merge(taglines, on=’id’, how=
‘left’)

MultiIndex merge
samuel_casts = samuel.merge(casts, on=['movie_id',
'cast_id'])
print(samuel_casts.head())
print(samuel_casts.shape)

Index merge with left_on and right_on
movies_genres = movies.merge(movie_to_genres, left_on=’id’, left_index=True,right_on=’movie_id’, right_index=True)
print(movies_genres.head())

Question 8

Q

Filtering joins

Answer

A

Mutating versus ltering joins
Mutating joins: Combines data from two tables based on matching observations in both tables
Filtering joins: Filter observations from table based on whether or not they match an observation in another
table

Semi-joins
Returns the intersection, similar to an inner join
Returns only columns from the left table and not the right
No duplicates

Semi-join
genres_tracks = genres.merge(top_tracks, on=’gid’)
top_genres = genres[genres[‘gid’].isin(genres_tracks[‘gid’])]
print(top_genres.head())

Anti-join:
Returns the left table, excluding the intersection
Returns only columns from the left table and not the right

Anti-join
genres_tracks = genres.merge(top_tracks, on=’gid’, how=
‘left’, indicator=True)
gid_list = genres_tracks.loc[genres_tracks[‘_merge’] ==
‘left_only’, ‘gid’]
non_top_genres = genres[genres[‘gid’].isin(gid_list)]
print(non_top_genres.head())

Question 9

Q

Concatenate DataFrames together vertically

Answer

A

Concatenate two tables vertically
Pandas .concat() method can concatenate both vertical and horizontal.
axis=0 , vertical

Basic concatenation
pd.concat([inv_jan, inv_feb, inv_mar])

Ignoring the index
pd.concat([inv_jan, inv_feb, inv_mar],
ignore_index=True)

Setting labels to original tables
pd.concat([inv_jan, inv_feb, inv_mar], ignore_index=False, keys=[‘jan’, ‘feb’, ‘mar’])

Concatenate tables with different column names
pd.concat([inv_jan, inv_feb],
sort=True)

Concatenate tables with different column names
pd.concat([inv_jan, inv_feb], join=’inner’)

Using append method .append()
Simplied version ofthe .concat() method
Supports: ignore_index , and sort
Does Not Support: keys and join
Always join = outer

Append the tables
inv_jan.append([inv_feb, inv_mar], ignore_index=True,
sort=True)

Question 10

Q

Verifying integrity

Answer

A

Validating merges
.merge(validate=None) :
Checks if merge is of specified type
'one_to_one'
'one_to_many'
'many_to_one'
'many_to_many'

Merge validate: one_to_one
tracks.merge(specs, on=’tid’,validate=’one_to_one’)

Merge validate: one_to_many
albums.merge(tracks, on=’aid’,validate=’one_to_many’)

Verifying concatenations
.concat(verify_integrity=False) :
Check whether the new concatenated index contains duplicates. Default value is False

Verifying concatenation: example
pd.concat([inv_feb, inv_mar],
verify_integrity=True) - checks whether there are duplicate values in the index

pd.concat([inv_feb, inv_mar],
verify_integrity=False)

Why verify integrity and what to do
Why:
Real world data is often NOT clean
What to do:
Fix incorrect data
Drop duplicate rows

Question 11

Q

Using merge_ordered()

Answer

A

Merging stock data
import pandas as pd
pd.merge_ordered(appl, mcd, on=’date’, suffixes=(‘_aapl’, ‘_mcd’))

Forward fill example
pd.merge_ordered(appl, mcd, on=’date’,
suffixes=(‘_aapl’, ‘_mcd’), fill_method=’ffill’)

When to use merge_ordered()?
Ordered data / time series
Filling in missing values

Question 12

Q

Using merge_asof()

Answer

A

merge_asof() example
pd.merge_asof(visa, ibm, on=’date_time’,
suffixes=(‘_visa’, ‘_ibm’))

merge_asof() example with direction
pd.merge_asof(visa, ibm, on=[‘date_time’],
suffixes=(‘_visa’, ‘_ibm’), direction=’forward’)

When to use merge_asof()
Data sampled from a process
Developing a training set (no data leakage)

Question 13

Q

Selecting data with .query()

Answer

A

The .query() method
.query(‘SOME SELECTION STATEMENT’)
Accepts an input string
Input string used to determine what rows are returned
Input string similar to statement afterWHERE clause in SQL statement
Prior knowledge of SQL is not necessary

Querying on a single condition
stocks.query(‘nike >= 90’)

Querying on a multiple conditions, “and”, “or”

stocks. query(‘nike > 90 and disney < 140’)
stocks. query(‘nike > 96 or disney < 98’)

Using .query() to select text
stocks_long.query(‘stock==”disney” or (stock==”nike” and close < 90)’)

Question 14

Q

Reshaping data with.melt()

Answer

A

Example of .melt()
social_fin_tall = social_fin.melt(id_vars=[‘financial’,
‘company’])
print(social_fin_tall.head(10))

Melting with value_vars
social_fin_tall = social_fin.melt(id_vars=[‘financial’,
‘company’], value_vars=[‘2018’, ‘2017’])
print(social_fin_tall.head(9))

Melting with column names
social_fin_tall = social_fin.melt(id_vars=[‘financial’,
‘company’], value_vars=[‘2018’, ‘2017’], var_name=’year’], value_name=’dollars’)
print(social_fin_tall.head(8))

Brainscape's Knowledge GenomeTM

Joining Data with pandas Flashcards

Brainscape's Knowledge Genome^TM