Pandas Data Frame Flashcards
import pandas as pd
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df.head()
The heart of pandas library is DataFrame. Works on 2d arrays like excel table.
Cost Item Purchased Name Store 1 22.5 Dog Food Chris Store 1 2.5 Kitty Litter Kevyn Store 2 5.0 Bird Seed Vinod
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df.loc[‘Store 2’]
give me all info under index Store 2
Cost 5
Item Purchased Bird Seed
Name Vinod
Name: Store 2, dtype: object
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
type(df.loc[‘Store 2’])
type is a series when we use loc
pandas.core.series.Series
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df.loc[‘Store 1’]
returns columns indexed to ‘Store 1’, the first two purchase series.
Cost Item Purchased Name Store 1 22.5 Dog Food Chris Store 1 2.5 Kitty Litter Kevyn
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df.loc[‘Store 1’, ‘Cost’]
if we only want the ‘Cost’ column for Store 1.
Store 1 22.5
Store 1 2.5
Name: Cost, dtype: float64
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df.T
swap all of the columns with rows.
Store 1 Store 1 Store 2 Cost 22.5 2.5 5 Item Purchased Dog Food Kitty Litter Bird Seed Name Chris Kevyn Vinod
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df.T.loc[‘Cost’]
iloc and loc are used for row selection. That’s why we swapped them.
Store 1 22.5
Store 1 2.5
Store 2 5
Name: Cost, dtype: object
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df[‘Cost’]
columns always have a label in pandas databases.
Store 1 22.5
Store 1 2.5
Store 2 5.0
Name: Cost, dtype: float64
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df.loc[:,[‘Name’, ‘Cost’]]
Give me all ‘Name’ and ‘Cost values for all stores. Can use list or String.
.loc also supports slicing. ‘:’ returns all values. Could use ‘Store 1’ instead
Name Cost
Store 1 Chris 22.5
Store 1 Kevyn 2.5
Store 2 Vinod 5.0
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df.drop(‘Store 1’)
delete copy of ‘Store 1’ from data frame. Original data frame is still intact.
drop has 2 optional params. If inplace param True, it will change original dataframe.
Second parameter is axis which can be dropped. 0 is default for row. 1 switches to column.
copy_df.drop(labels, axis=0, level=None, inplace=False, errors=’raise’)
Cost Item Purchased Name Store 2 5.0 Bird Seed Vinod
df.drop(‘Store 1’)
print(df)
Full data frame is still in tact.
Cost Item Purchased Name Store 1 22.5 Dog Food Chris Store 1 2.5 Kitty Litter Kevyn Store 2 5.0 Bird Seed Vinod
del copy_df[‘Name’]
easier way of dropping a column.
df[‘Location’] = None
df
creates a new column and broadcasts default value to new column.
df[‘Location’] = None # creates a new column and broadcasts default value to new column.
df
Cost Item Purchased Name Location Store 1 22.5 Dog Food Chris None Store 1 2.5 Kitty Litter Kevyn None Store 2 5.0 Bird Seed Vinod None
costs = df[‘Cost’]
costs
gives me same results as df[‘Costs’]
Store 1 22.5
Store 1 2.5
Store 2 5.0
Name: Cost, dtype: float64
costs = df[‘Cost’]
costs + 2
Before:
Store 1 22.5
Store 1 2.5
Store 2 5.0
After: Store 1 24.5 Store 1 4.5 Store 2 7.0 Name: Cost, dtype: float64
df = pd.read_csv(‘olympics.csv’)
reads csv file as is and stores it in data frame.
df = pd.read_csv(‘olympics.csv’, index_col = 0, skiprows=1)
The code reads the csv file in.
skiprows skips the top row of labels so it changes column names.
Params indicate which column should be index & skip first row.
df.columns
pandas stores all columns in this attribute.
Index([‘№ Summer’, ‘01 !’, ‘02 !’, ‘03 !’, ‘Total’, ‘№ Winter’, ‘01 !.1’, ‘02 !.1’, ‘03 !.1’, ‘Total.1’, ‘№ Games’, ‘01 !.2’, ‘02 !.2’, ‘03 !.2’, ‘Combined total’], dtype=’object’)
for col in df.columns:
if col[:2]==’01’:
df.rename(columns={col:’Gold’ + col[4:]}, inplace=True)
if col[:2]==’02’:
df.rename(columns={col:’Silver’ + col[4:]}, inplace=True)
if col[:2]==’03’:
df.rename(columns={col:’Bronze’ + col[4:]}, inplace=True)
if col[:1]==’№’:
df.rename(columns={col:’#’ + col[1:]}, inplace=True)
renaming the column names so they are clearer.
if the first two numbers in an element are equal.
inplace tells pandas to update df directly.
df[‘Feedback’] = [‘Positive’, None, ‘Negative’]
df
Need to make sure there are enough values.
Create a column w/ the list values in each row.
adf = df.reset_index()
adf[‘Date’] = pd.Series({0: ‘December 1’, 2: ‘mid-May’})
adf
It’s better to make sure there are key values so we don’t have to type None values.
If you have indexes when adding a list of values to a new column, NaN will be added to rows that are not referenced in dictionary
pd.merge(staff_df, student_df, how=’outer’, left_index=True, right_index=True)
# we want to do an outer join # if we want to merge based on a column name and use that column as the index, (left_index = True, right_on = "Column Name")
Role School Name James Grader Business Kelly Director of HR NaN Mike NaN Law Sally Course liasion Engineering
pd.merge(staff_df, student_df, how=’inner’, left_index=True, right_index=True)
we want to do an inner join.
Role School
Name
James Grader Business
Sally Course liasion Engineering
pd.merge(staff_df, student_df, how=’left’, left_index=True, right_index=True)
we want all staff but we also want their student info if the’re a student (left join)
Role School Name Kelly Director of HR NaN Sally Course liasion Engineering James Grader Business
pd.merge(staff_df, student_df, how=’right’, left_index=True, right_index=True)
we want all student but we also want their staff info if the’re a staff (right join)
Role School Name James Grader Business Mike NaN Law Sally Course liasion Engineering
staff_df = staff_df.reset_index()
student_df = student_df.reset_index()
pd.merge(staff_df, student_df, how=’left’, left_on=’Name’, right_on=’Name’)
you can use columns instead of indices to join on.
Name Role School
0 Kelly Director of HR NaN
1 Sally Course liasion Engineering
2 James Grader Business