Merging Dataframes / Datacamp(1,2,3 not project) Flashcards
Use a loop to store these files in a list as data frames and unpacked the list into variables called gold,silver, bronze
Gold.csv
Silver.csv
Bronze.csv
import pandas as pd
filenames = [‘Gold.csv’, ‘Silver.csv’, ‘Bronze.csv’]
dataframes = []
for filename in filenames:
dataframes.append(pd.read_csv(filename))
gold,silver,bronze=dataframes
Make a copy of the df gold and call it medals
medals = gold.copy()
Rename columns of df as Code, Country, Total
A B C
0 USA United States 2088.0
1 URS Soviet Union 838.0
new_labels = ['NOC', 'Country', 'Gold'] medals.columns = new_labels
Add columns ‘Total’ from data frames Silver and Bronze into dataframe gold
medals['Silver'] = silver['Total'] medals['Bronze'] = bronze['Total']
Change order of the dataframe
0 1
2013-12-31 -0.31 0.83
2014-12-31 -0.63 -0.19
2015-12-31 -0.53 -0.85
0 1
2015-12-31 -0.53 -0.85
2014-12-31 -0.63 -0.19
2013-12-31 -0.31 0.83
df.sort_index(ascending=False,inplace=True)
Change order of the dataframe
0 1
2013-12-31 -0.31 0.83
2014-12-31 -0.63 -0.19
2015-12-31 -0.53 -0.85
#Outcome: 0 1 2014-12-31 -0.63 -0.19 2015-12-31 -0.53 -0.85 2013-12-31 -0.31 0.83
df.sort_values(0, ascending=True, inplace=True)
Fill forward NaN
A B C D 0 5.0 NaN 4 5.0 1 3.0 2.0 3 4.0 2 NaN 4.0 8 2.0 3 4.0 3.0 5 NaN
A B C D 0 5.0 NaN 4 5.0 1 3.0 2.0 3 4.0 2 3.0 4.0 8 2.0 3 4.0 3.0 5 2.0
df.ffill(axis = 0)
Fill forward NaN
A B C D 0 5.0 NaN 4 5.0 1 3.0 2.0 3 4.0 2 NaN 4.0 8 2.0 3 4.0 3.0 5 NaN
A B C D 0 5.0 5.0 4.0 5.0 1 3.0 2.0 3.0 4.0 2 NaN 4.0 8.0 2.0 3 4.0 3.0 5.0 5.0
df.ffill(axis = 1)
Outcome after reindexing df
a b c
A 0.338244 0.548040 0.898135
B 0.709872 0.965031 0.521506
C 0.605053 0.340734 0.139700
df1.reindex([‘Ro’, ‘A’, ‘B’, ‘Ri’,])
a b c Ro NaN NaN NaN A 0.187632 0.587335 0.141530 B 0.900119 0.984410 0.648169 Ri NaN NaN NaN
Transform the first dataframe into the second one using a list called new_indexes
a b c
A 0.800152 0.485063 0.607966
B 0.509770 0.856334 0.820924
C 0.249211 0.675843 0.096703
a b c Ro missing missing missing A 0.800152 0.485063 0.607966 B 0.50977 0.856334 0.820924 Ri missing missing missing
new_index=[‘Ro’, ‘A’, ‘B’, ‘Ri’,]
print(df1.reindex(new_index, axis =’rows’, fill_value =’missing’))
Add these two dataframes and guess the output
a b c
A 0.8 0.4 0.3
B 0.3 0.9 0.2
C 0.9 0.9 0.4
a b c
Z 0.4 0.5 0.3
B 0.1 0.4 0.4
C 0.4 0.7 0.7
df0+df1
print(df0+df1)
aN NaN NaN
B 0.4 1.3 0.6
C 1.3 1.6 1.1
Z NaN NaN NaN
What is the output of multiplying 10 with this dataframe?
ax bx cx
Z 0.1 0.7 0.9
B 0.8 0.6 0.1
C 0.1 0.5 0.1
ax bx cx
Z 1.0 7.0 9.0
B 8.0 6.0 1.0
C 1.0 5.0 1.0
Resample into year and get the mean
0 1 2013-01-31 -2.10 -1.73 2013-02-28 -1.15 -0.74 2013-03-31 -0.51 1.65 2013-04-30 -1.10 -2.13
yearly = df.resample(‘A’).mean()
yearly
0 1
2013-12-31 -1.215 -0.7375
Percentage growth of ‘c’
a b c
A 0.5 1.0 0.4
B 0.2 0.8 0.5
C 0.8 0.5 0.6
df_1=df[‘growth_c%’] = df0[‘c’].pct_change()*100
Make sure the dataframe gets the dates
sp500 = pd.read_csv(‘sp500.csv’,index_col=’Date’,…)
sp500 = pd.read_csv(‘sp500.csv’,index_col=’Date’, parse_dates=True)
Convert the sp500 into GBP
#df_sp500 in USD Open Close Date 2015-01-02 2058.899902 2058.199951 2015-01-05 2054.439941 2020.579956 2015-01-06 2022.150024 2002.609985
#df_exchange GBP/USD Date 2015-01-02 0.65101 2015-01-05 0.65644 2015-01-06 0.65896
pounds = df_sp500.multiply(df_exchange[‘GBP/USD’], axis=’rows’)
Append df_silver and df_bronze into a df called combined and print United States result
#df_silver Country United States 1052.0 Soviet Union 584.0 United Kingdom 505.0
#df_bronze Country United States 1195.0 Soviet Union 627.0 United Kingdom 591.0
combined=bronze.append(silver)
combined.loc[‘United States’]
Append the Series Units for both df into a series called q1 and add all the units sold
#df_jan Date 2015-01-21 Streeplex Hardware 11 2015-01-09 Streeplex Service 8 2015-01-06 Initech Hardware 17
#df_feb Company Product Units Date 2015-02-26 Streeplex Service 4 2015-02-16 Hooli Software 10 2015-02-03 Initech Software 13
jan_units = jan[‘Units’]
feb_units = feb[‘Units’]
q1=jan_units.append(feb_units)
q1.sum()
Use a loop to concatenate data frames (jan,feb, mar) by its Units into a dataframe called quarter1
Company Product Units Date 2015-02-26 Streeplex Service 4 2015-02-16 Hooli Software 10 2015-02-03 Initech Software 13
units = []
for month in [jan, feb, mar]:
units.append(month[‘Units’])
quarter1 = pd.concat(units, axis=’rows’)
Concatenate both data frames
df1 A B C D 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2
df2 A B C D 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6
Import pandas as pd
pd.concat([df1, df2])
difference between append and concat method in pandas
Append combines the rows of one dataframe to another dataframe. To be honest, I don’t use this method ever.
Concat can take a group of 2+ dataframes and combines the dataframes via the rows or columns.
Outcome of:
df1.append(df2)
#df1 a b 0 1 5 1 2 6 2 3 7
#df2 a b 0 9 4 1 6 2 2 3 10
a b 0 1 5 1 2 6 2 3 7 3 4 8 0 9 4 1 6 2 2 3 10
Append df1 and df2 and make the index to be a sequence of integers starting from 0
df1.append(df2, ignore_index = True)
Concatenate horizontally df1 and df2
pd.concat([df1,df2],axis=1)
Concatenate vertically df1 and df2, and make the index to be a sequence of integers starting from 0
pd.concat([df1,df2],axis=0,ignore_index=True)
Use a loop to read csv files named [gold_top5.csv, silver.., bronze] . Create a df with 4 columns (Country, medals) and make the Country the index. Use append method
#Outcome: bronze silver gold France 475.0 461.0 NaN Germany 454.0 NaN 407.0 Italy NaN 394.0 460.0
medals_type=[‘bronze’, ‘silver’, ‘gold’]
medals =[]
for medal in medal_types: file_name = "%s_top5.csv" % medal columns = ['Country', medal] medal_df = pd.read_csv(file_name, index_col='Country', names=columns) medals.append(medal_df)
medals_df = pd.concat(medals, axis=’columns’)
print(medals_df.head(3)
#Concatenating vertically to get MultiIndexed rows #Concatenate a list of dataframes
To do
medals = pd.concat(medals,keys=[‘bronze’, ‘silver’, ‘gold’])
#Code to get the given outcome: df=medals
Total Country bronze United States 1052.0 Soviet Union 584.0 United Kingdom 505.0 … gold United Kingdom 498.0 Italy 460.0 Germany 407.0
#Outcome: Total Country bronze France 475.0 silver France 461.0 bronze Germany 454.0 gold Germany 407.0
medals.sort_index(level=1)
Slice index to get medals from UK
Country bronze United States 1052.0 Soviet Union 584.0 United Kingdom 505.0 ...
idx = pd.IndexSlice
medals_sorted.loc[idx[:,’United Kingdom’],:]
#Outcome: Total Country bronze United Kingdom 505.0 gold United Kingdom 498.0 silver United Kingdom 591.0
Get silver medals
Country
bronze United States 1052.0
Soviet Union 584.0
United Kingdom 505.0
medals_sorted.loc[‘silver’]
Outcome: Total Country France 461.0 Italy 394.0 Soviet Union 627.0 United Kingdom 591.0 United States 1195.0
You have 3 dataframes in a dataframes list. Concatenate horizontally using their product a hierarchical column that contains the rest of the information
#df1 Company Product Units Date 2015-02-04 Acme Coporation Hardware 14 2015-02-07 Acme Coporation Hardware 1 2015-02-19 Mediacore Hardware 16
#Outcome: Hardware Software Service Company Product Units ……. Date 2015-02-02 Hooli Software 3 …..
pd.concat(dataframes, keys=[‘Hardware’,’Software’,’Service’], axis=1)
In a multi-indexed column dataframe with Hardware’,’Software’,’Service as top columns and Company Product Units second level coluumns, get Company data for 2015-2-2 to 2015-2-8
Hardware Software Service Company Company Company Date 2015-02-02 08:33:01 NaN Hooli NaN 2015-02-02 20:54:49 Mediacore NaN NaN
idx = pd.IndexSlice
february.loc[‘2015-2-2’:’2015-2-8’, idx[:,’Company’]]
Use the list month_list to populate a dict the value as df. Grouped by Company. Sum its values. Then concatenate the dictionary into a dataframe called sales
month_list = [(‘january’, jan), (‘february’, feb), (‘march’, mar)]
Outcome:
month_dict = {}
for month_name, month_data in month_list:
month_dict[month_name] = month_data.groupby(‘Company’).sum()
sales = pd.concat(month_dict)
Units Company february Acme Coporation 34 Hooli 30 Initech 30 Mediacore 45
Inner join of three similar dataframes (gold, silver, bronze) and visualize more or less the final dataframe
#bronze Total Country United States 1052.0 Soviet Union 584.0 United Kingdom 505.0
medals=pd.concat(medal_list,keys=[‘bronze’, ‘silver’, ‘gold’],axis=1,join=’inner’)
#Outcome bronze silver gold Total Total Total Country United States 1052.0 1195.0 2088.0 Soviet Union 584.0 627.0 838.0
What does inner join do?
keep only rows that share common index labels.
Change Date to annually using the last value of each year, chain a percentage growth and delete null values
China Year 1961-01-01 49.557050 1962-01-01 46.685179 1963-01-01 50.097303
Outcome:
china_annual = china.resample(‘A’).last().pct_change(10).dropna()
China
Year
1971-12-31 0.988860
1972-12-31 1.402472
Use inner join with two dataframes called china, us with similar structure
#Outcome China US Year 1971-12-31 0.988860 1.052270 1972-12-31 1.402472 1.172566
pd.concat([china_annual, us_annual], axis=1, join=’inner’)
Merge both df. How many rows will have the new df?
#df_1 city revenue 0 Austin 100 1 Denver 83 2 Springfield 4
#df_2 city manager 0 Austin Charlers 1 Denver Joel 2 Mendocino Brett
combined = pd.merge(revenue, managers, on=’city’)
Two lines
Merge both using cities. What will be the structure of the new dataframe?
city branch_id state revenue
0 Austin 10 TX 100
1 Denver 20 CO 83
branch branch_id state manager
0 Austin 10 TX Charlers
1 Denver 20 CO Joel
combined = pd.merge(revenue,managers,left_on=’city’,right_on=’branch’)
Columns of the new df=city branch_id_x state_x revenue branch branch_id_y state_y manager
Outcome of:
pd.merge(revenue, managers, on=’city’)
city branch_id state revenue
0 Austin 10 TX 100
1 Denver 20 CO 83
branch branch_id state manager
0 Austin 10 TX Charlers
1 Denver 20 CO Joel
Traceback (most recent call last): ... ... pd.merge(revenue, managers, on='city') ... ... KeyError: 'city'
Join both dataframes df1,df2 and visualize Outcome:
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
C D
K0 C0 D0
K1 C1 D1
K2 C2 D2
left.join(right)
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 C1 D1
K2 A2 B2 C2 D2
Join using suffix and prefix. Visualize outcome
#revenue city state revenue branch_id 10 Austin TX 100 20 Denver CO 83 30 Springfield IL 4
#managers branch state manager branch_id 10 Austin TX Charlers 20 Denver CO Joel 47 Mendocino CA Brett
revenue.join(managers,lsuffix=’_rev’,rsuffix=’_mng’,how=’outer’)
city state_rev revenue branch state_mng manager branch_id 10 Austin TX 100.0 Austin TX Charlers 20 Denver CO 83.0 Denver CO Joel 30 Springfield IL 4.0 NaN NaN NaN 31 NaN NaN NaN Springfield MO Sally
What is the difference between merge and join in pandas?
join() methods as a convenient way to access the capabilities of pandas. … join(df2) always joins via the index of df2 , but df1. merge(df2) can join to one or more columns of df2 (default) or to the index of df2 (with right_index=True ). lookup on left table: by default, df1
What does left join do?
Can add suffixes
Left outer join produces a complete set of records from Table A, with the matching records (where available) in Table B. If there is no match, the right side will contain null.
pd.merge(df_a, df_b, on=’subject_id’, how=’left’)
pd.merge(df_a, df_b, on=’subject_id’, how=’left’, suffixes=(‘_left’, ‘_right’)
What does merge_ordered do?
Perform merge with optional filling/interpolation.
Designed for ordered data like time series data