Pandas Data Frame Flashcards
import pandas as pd
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df.head()
The heart of pandas library is DataFrame. Works on 2d arrays like excel table.
Cost Item Purchased Name Store 1 22.5 Dog Food Chris Store 1 2.5 Kitty Litter Kevyn Store 2 5.0 Bird Seed Vinod
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df.loc[‘Store 2’]
give me all info under index Store 2
Cost 5
Item Purchased Bird Seed
Name Vinod
Name: Store 2, dtype: object
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
type(df.loc[‘Store 2’])
type is a series when we use loc
pandas.core.series.Series
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df.loc[‘Store 1’]
returns columns indexed to ‘Store 1’, the first two purchase series.
Cost Item Purchased Name Store 1 22.5 Dog Food Chris Store 1 2.5 Kitty Litter Kevyn
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df.loc[‘Store 1’, ‘Cost’]
if we only want the ‘Cost’ column for Store 1.
Store 1 22.5
Store 1 2.5
Name: Cost, dtype: float64
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df.T
swap all of the columns with rows.
Store 1 Store 1 Store 2 Cost 22.5 2.5 5 Item Purchased Dog Food Kitty Litter Bird Seed Name Chris Kevyn Vinod
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df.T.loc[‘Cost’]
iloc and loc are used for row selection. That’s why we swapped them.
Store 1 22.5
Store 1 2.5
Store 2 5
Name: Cost, dtype: object
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df[‘Cost’]
columns always have a label in pandas databases.
Store 1 22.5
Store 1 2.5
Store 2 5.0
Name: Cost, dtype: float64
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df.loc[:,[‘Name’, ‘Cost’]]
Give me all ‘Name’ and ‘Cost values for all stores. Can use list or String.
.loc also supports slicing. ‘:’ returns all values. Could use ‘Store 1’ instead
Name Cost
Store 1 Chris 22.5
Store 1 Kevyn 2.5
Store 2 Vinod 5.0
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])
df.drop(‘Store 1’)
delete copy of ‘Store 1’ from data frame. Original data frame is still intact.
drop has 2 optional params. If inplace param True, it will change original dataframe.
Second parameter is axis which can be dropped. 0 is default for row. 1 switches to column.
copy_df.drop(labels, axis=0, level=None, inplace=False, errors=’raise’)
Cost Item Purchased Name Store 2 5.0 Bird Seed Vinod
df.drop(‘Store 1’)
print(df)
Full data frame is still in tact.
Cost Item Purchased Name Store 1 22.5 Dog Food Chris Store 1 2.5 Kitty Litter Kevyn Store 2 5.0 Bird Seed Vinod
del copy_df[‘Name’]
easier way of dropping a column.
df[‘Location’] = None
df
creates a new column and broadcasts default value to new column.
df[‘Location’] = None # creates a new column and broadcasts default value to new column.
df
Cost Item Purchased Name Location Store 1 22.5 Dog Food Chris None Store 1 2.5 Kitty Litter Kevyn None Store 2 5.0 Bird Seed Vinod None
costs = df[‘Cost’]
costs
gives me same results as df[‘Costs’]
Store 1 22.5
Store 1 2.5
Store 2 5.0
Name: Cost, dtype: float64
costs = df[‘Cost’]
costs + 2
Before:
Store 1 22.5
Store 1 2.5
Store 2 5.0
After: Store 1 24.5 Store 1 4.5 Store 2 7.0 Name: Cost, dtype: float64
df = pd.read_csv(‘olympics.csv’)
reads csv file as is and stores it in data frame.
df = pd.read_csv(‘olympics.csv’, index_col = 0, skiprows=1)
The code reads the csv file in.
skiprows skips the top row of labels so it changes column names.
Params indicate which column should be index & skip first row.
df.columns
pandas stores all columns in this attribute.
Index([‘№ Summer’, ‘01 !’, ‘02 !’, ‘03 !’, ‘Total’, ‘№ Winter’, ‘01 !.1’, ‘02 !.1’, ‘03 !.1’, ‘Total.1’, ‘№ Games’, ‘01 !.2’, ‘02 !.2’, ‘03 !.2’, ‘Combined total’], dtype=’object’)
for col in df.columns:
if col[:2]==’01’:
df.rename(columns={col:’Gold’ + col[4:]}, inplace=True)
if col[:2]==’02’:
df.rename(columns={col:’Silver’ + col[4:]}, inplace=True)
if col[:2]==’03’:
df.rename(columns={col:’Bronze’ + col[4:]}, inplace=True)
if col[:1]==’№’:
df.rename(columns={col:’#’ + col[1:]}, inplace=True)
renaming the column names so they are clearer.
if the first two numbers in an element are equal.
inplace tells pandas to update df directly.
df[‘Feedback’] = [‘Positive’, None, ‘Negative’]
df
Need to make sure there are enough values.
Create a column w/ the list values in each row.
adf = df.reset_index()
adf[‘Date’] = pd.Series({0: ‘December 1’, 2: ‘mid-May’})
adf
It’s better to make sure there are key values so we don’t have to type None values.
If you have indexes when adding a list of values to a new column, NaN will be added to rows that are not referenced in dictionary
pd.merge(staff_df, student_df, how=’outer’, left_index=True, right_index=True)
# we want to do an outer join # if we want to merge based on a column name and use that column as the index, (left_index = True, right_on = "Column Name")
Role School Name James Grader Business Kelly Director of HR NaN Mike NaN Law Sally Course liasion Engineering
pd.merge(staff_df, student_df, how=’inner’, left_index=True, right_index=True)
we want to do an inner join.
Role School
Name
James Grader Business
Sally Course liasion Engineering
pd.merge(staff_df, student_df, how=’left’, left_index=True, right_index=True)
we want all staff but we also want their student info if the’re a student (left join)
Role School Name Kelly Director of HR NaN Sally Course liasion Engineering James Grader Business
pd.merge(staff_df, student_df, how=’right’, left_index=True, right_index=True)
we want all student but we also want their staff info if the’re a staff (right join)
Role School Name James Grader Business Mike NaN Law Sally Course liasion Engineering
staff_df = staff_df.reset_index()
student_df = student_df.reset_index()
pd.merge(staff_df, student_df, how=’left’, left_on=’Name’, right_on=’Name’)
you can use columns instead of indices to join on.
Name Role School
0 Kelly Director of HR NaN
1 Sally Course liasion Engineering
2 James Grader Business
pd.merge(staff_df, student_df, how=’left’, left_on=’Name’, right_on=’Name’)
demonstrates what happens when we have a conflict in data. Adds _x for left column and _y for right column.
pd.merge(staff_df, student_df, how=’inner’, left_on=[‘First Name’,’Last Name’], right_on=[‘First Name’,’Last Name’])
# many staff and students might having matching first names, but not matching last names. # the inner join only returns the row in which both keys match so only Sally Brooks is returned.
(df.where(df[‘SUMLEV’]==50)
.dropna()
.set_index([‘STNAME’,’CTYNAME’])
.rename(columns={‘ESTIMATESBASE2010’: ‘Estimates Base 2010’}))
# demonstrates method chaining # if you begin w/ an open parentheses you can use it over multiple lines.
def min_max(row): data = row[['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']] return pd.Series({'min': np.min(data), 'max': np.max(data)})
df.apply(min_max, axis=1)
# see map function api. Takes a function and applies it to all items in a list. # apply takes a function and the axis on which we want the function to operate on. # axis = 1 applies function across all rows.
max min STNAME CTYNAME Alabama Autauga County 55347.0 54660.0 Baldwin County 203709.0 183193.0 Barbour County 27341.0 26489.0
df.apply(lambda x: np.max(x[rows]), axis=1)
# apply functions are used often w/ lambdas # you can chain several apply calls w/ lambdas together. # calculate the max of function using apply call.
for state in df[‘STNAME’].unique():
avg = np.average(df.where(df[‘STNAME’]==state).dropna()[‘CENSUS2010POP’])
print(‘Counties in state ‘ + state + ‘ have an average population of ‘ + str(avg))
# hear we use the census date and get a list of unique states. # for each state we reduce the dataframe and calculate the average. # this takes a long time to run
for group, frame in df.groupby(‘STNAME’):
avg = np.average(frame[‘CENSUS2010POP’])
print(‘Counties in state ‘ + group + ‘ have an average population of ‘ + str(avg))
# this does the same thing as for loop w/ unique method, but it is much faster because it uses a groupby object. # we're interested in grouping by state name and we will calculate the average using one column and all of the data in that column.
df = df.set_index(‘STNAME’)
def fun(item): if item[0]
groupby is used 99% of the time on one or more columns, but you can provide a function to group by also.
use the fun function to let groupby know how to split up dataframe.
if first letter is ‘M’ is we return 0, “Q’ we return 1, and otherwise we return 2
df.groupby(‘STNAME’).agg({‘CENSUS2010POP’: np.average})
# common for groupby to split data, apply some function, and then combine the results. # split, apply, combine pattern.
# agg or aggregate applies function to columns of data in group and returns results. # builds a summary data frame # pass in dictionary w/ column and function you want to apply.
CENSUS2010POP STNAME Alabama 71339.343284 Alaska 24490.724138 Arizona 426134.466667 Arkansas 38878.906667 California 642309.586207 Colorado 78581.187500
# there are two different groupbys, the Series groupby and DataFrame groupby print(type(df.groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011']))
DataFrame groupby
# there are two different groupbys, the Series groupby and DataFrame groupby print(type(df.groupby(level=0)['POPESTIMATE2010'])) # Series groupby
Series Groupby
(df.set_index(‘STNAME’).groupby(level=0)[‘CENSUS2010POP’]
.agg({‘avg’: np.average, ‘sum’: np.sum}))
# calls set_index and tells pandas to groupby index using level parameter. # level=0 tells it to use columns(I think) # since there's only one column of data, applies both functions to that column. # notice this is a series
sum avg STNAME Alabama 4779736 71339.343284 Alaska 710231 24490.724138 Arizona 6392017 426134.466667 Arkansas 2915918 38878.906667
(df.set_index(‘STNAME’).groupby(level=0)[‘POPESTIMATE2010’,’POPESTIMATE2011’]
.agg({‘avg’: np.average, ‘sum’: np.sum}))
# calls set_index and tells pandas to groupby index using level parameter. # does the same thing as above, but on two different columns # notice this is a DataFrame
sum avg POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2010 POPESTIMATE2011 STNAME Alabama 4785161 4801108 71420.313433 71658.328358 Alaska 714021 722720 24621.413793 24921.379310
(df.set_index(‘STNAME’).groupby(level=0)[‘POPESTIMATE2010’,’POPESTIMATE2011’]
.agg({‘POPESTIMATE2010’: np.average, ‘POPESTIMATE2011’: np.sum}))
# pandas maps the functions directly to columns here instead of creating a hierarchical column. # pandas uses the dictionary keys as the column names.
POPESTIMATE2011 POPESTIMATE2010 STNAME Alabama 4801108 71420.313433 Alaska 722720 24621.413793
Four Scales
#1. Ratio Scale: # units equally spaced # mathematical operations +-/* are all valid # E.g height and weight measurements # 2. Interval scale # Units are equally spaced, but there is no true zero # Temperature is example. 0 degrees doesn't represent a real zero. # 3. Ordinal Scale # The order of the units is important, but not evenly spaced. # Letter grades such as A+, A are a good example. # when you compare percentage values, regular As represent 5 poiints, but A+ only counta for . # 4. Nominal Scale: # Categories of data, but the categories have no order w/ respect to one another. # E.g Teams of a sport # categories w/ only two possible values are referred to as binary.
df = pd.DataFrame([‘A+’, ‘A’, ‘A-‘, ‘B+’, ‘B’, ‘B-‘, ‘C+’, ‘C’, ‘C-‘, ‘D+’, ‘D’],
index=[‘excellent’, ‘excellent’, ‘excellent’, ‘good’, ‘good’, ‘good’, ‘ok’, ‘ok’, ‘ok’, ‘poor’, ‘poor’])
df.rename(columns={0: ‘Grades’}, inplace=True)
df
starts w/ nominal data which is called category data in pandas.
Grades excellent A+ excellent A excellent A- good B+ good B good B- ok C+
df[‘Grades’].astype(‘category’).head()
# instructing pandas to refer to data as categorical data. # dtype has been set to category.
grades = df[‘Grades’].astype(‘category’,
categories=[‘D’, ‘D+’, ‘C-‘, ‘C’, ‘C+’, ‘B-‘, ‘B’, ‘B+’, ‘A-‘, ‘A’, ‘A+’],
ordered=True)
grades.head()
# you can change this to ordinal data w/ ordered = True flag. # Ordinal data helps for boolean masking. # if we compare the values lexigraphically, C+ and C- are greater than C. # Ordering the data makes it clear that there is a clear order to the data.
grades = df[‘Grades’].astype(‘category’,
categories=[‘D’, ‘D+’, ‘C-‘, ‘C’, ‘C+’, ‘B-‘, ‘B’, ‘B+’, ‘A-‘, ‘A’, ‘A+’],
ordered=True)
grades > ‘C’
excellent True excellent True excellent True good True good True good True ok True ok False ok False
df = df.set_index(‘STNAME’).groupby(level=0)[‘CENSUS2010POP’].agg({‘avg’: np.average})
pd.cut(df[‘avg’], 10)
# cut takes an argument of some array-like structure and an int that represents a number of bins. # separates states into categories by average census populations.
df.pivot_table(values=’(kW)’, index=’YEAR’, columns=’Make’, aggfunc=np.mean)
# pivot table is in itself a data frame. # allows us to pivot out a column as a column header and to compare it to the other column headers.
df.pivot_table(values=’(kW)’, index=’YEAR’, columns=’Make’, aggfunc=[np.mean, np.min], margins=True)
# you can pass aggfunc a list of functions you want to apply. # you can pass any function you want to aggfunc including those you write yourself.
pd.Timestamp(‘9/1/2016 10:05AM’)
timestamp interchangable w/ datetime in most cases.
pd.Period(‘1/2016’)
creates time period
t1 = pd.Series(list(‘abc’), [pd.Timestamp(‘2016-09-01’), pd.Timestamp(‘2016-09-02’), pd.Timestamp(‘2016-09-03’)])
t1
here each timestamp has a timestamp index
2016-09-01 a
2016-09-02 b
2016-09-03 c
t2 = pd.Series(list(‘def’), [pd.Period(‘2016-09’), pd.Period(‘2016-10’), pd.Period(‘2016-11’)])
t2
has a period index.
2016-09 d
2016-10 e
2016-11 f
d1 = [‘2 June 2013’, ‘Aug 29, 2014’, ‘2015-06-26’, ‘7/12/16’]
ts3 = pd.DataFrame(np.random.randint(10, 100, (4,2)), index=d1, columns=list(‘ab’))
ts3
# convert to Datetime # uses dates as index and fills rows w/ random values.
a b 2 June 2013 99 69 Aug 29, 2014 73 76 2015-06-26 82 71 7/12/16 83 30
ts3.index = pd.to_datetime(ts3.index)
ts3
converts indeces into Datetime format
pd.Timestamp(‘9/3/2016’)-pd.Timestamp(‘9/1/2016’)
tells us exactly how much time passed between two dates.
dates = pd.date_range(‘10-01-2016’, periods=9, freq=’2W-SUN’)
dates
# we want to look at 9 measurements taken bi-weekly. # we use date_range to create the indeces.
DatetimeIndex([‘2016-10-02’, ‘2016-10-16’, ‘2016-10-30’, ‘2016-11-13’,
‘2016-11-27’, ‘2016-12-11’, ‘2016-12-25’, ‘2017-01-08’,
‘2017-01-22’],
dtype=’datetime64[ns]’, freq=’2W-SUN’)
dates = pd.date_range(‘10-01-2016’, periods=9, freq=’2W-SUN’)
df = pd.DataFrame({‘Count 1’: 100 + np.random.randint(-5, 10, 9).cumsum(),
‘Count 2’: 120 + np.random.randint(-5, 10, 9)}, index=dates)
df
creates a Dataframe using dates and random data.
dates = pd.date_range(‘10-01-2016’, periods=9, freq=’2W-SUN’)
df = pd.DataFrame({‘Count 1’: 100 + np.random.randint(-5, 10, 9).cumsum(),
‘Count 2’: 120 + np.random.randint(-5, 10, 9)}, index=dates)
df.index.weekday_name
create a list of the days of week for the specific dates.
dates = pd.date_range(‘10-01-2016’, periods=9, freq=’2W-SUN’)
df = pd.DataFrame({‘Count 1’: 100 + np.random.randint(-5, 10, 9).cumsum(),
‘Count 2’: 120 + np.random.randint(-5, 10, 9)}, index=dates)
df.diff()
find the difference between each date’s value.
Count 1 Count 2 2016-10-02 NaN NaN 2016-10-16 1.0 6.0 2016-10-30 4.0 3.0
df = pd.DataFrame({‘Count 1’: 100 + np.random.randint(-5, 10, 9).cumsum(),
‘Count 2’: 120 + np.random.randint(-5, 10, 9)}, index=dates)
df.resample(‘M’).mean()
find the mean count for each date in DataFrame.
Count 1 Count 2 2016-10-31 103.0 120.0 2016-11-30 105.5 118.0 2016-12-31 107.0 123.0 2017-01-31 106.0 121.0
df[‘2017’]
use partial string indexing to find values from a particular year or month
Count 1 Count 2
2017-01-08 108 116
2017-01-22 104 126
df[‘2016-12’]
use partial string indexing to find values from a particular year and month
Count 1 Count 2 2016-12-11 104 119 2016-12-25 110 127
df[‘2016-12’:]
# slice on a range of dates starting December 2016 # here we only want the dates from Dec. 2016 onward.
df.asfreq(‘W’, method=’ffill’)
change the frequency of dates from bi-weekly to weekly and use forward fill to fill in dates.
import matplotlib.pyplot as plt
%matplotlib inline
df.plot()
allows you to visualize time series in notebook.