df['Location'] = None df

creates a new column and broadcasts default value to new column. df['Location'] = None creates a new column and broadcasts default value to new column. df Cost Item Purchased Name Location Store 1 22.5 Dog Food Chris None Store 1 2.5 Kitty Litter Kevyn None Store 2 5.0 Bird Seed Vinod None

Pandas Data Frame Flashcards by Alex Moorman

import pandas as pd
purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])

df.head()

The heart of pandas library is DataFrame. Works on 2d arrays like excel table.

       Cost   	Item Purchased 	Name Store 1 	22.5 	Dog Food 	        Chris Store 1 	2.5 	        Kitty Litter 	        Kevyn Store 2 	5.0     	Bird Seed 	        Vinod

How well did you know this?

Not at all

Perfectly

purchase_1 = pd.Series({‘Name’: ‘Chris’,
‘Item Purchased’: ‘Dog Food’,
‘Cost’: 22.50})
purchase_2 = pd.Series({‘Name’: ‘Kevyn’,
‘Item Purchased’: ‘Kitty Litter’,
‘Cost’: 2.50})
purchase_3 = pd.Series({‘Name’: ‘Vinod’,
‘Item Purchased’: ‘Bird Seed’,
‘Cost’: 5.00})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])

df.loc[‘Store 2’]

give me all info under index Store 2

Cost 5
Item Purchased Bird Seed
Name Vinod
Name: Store 2, dtype: object

How well did you know this?

Not at all

Perfectly

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])

type(df.loc[‘Store 2’])

type is a series when we use loc

pandas.core.series.Series

How well did you know this?

Not at all

Perfectly

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])

df.loc[‘Store 1’]

returns columns indexed to ‘Store 1’, the first two purchase series.

        Cost 	Item Purchased 	Name Store 1 	22.5 	Dog Food 	       Chris Store 1 	2.5 	        Kitty Litter 	       Kevyn

How well did you know this?

Not at all

Perfectly

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])

df.loc[‘Store 1’, ‘Cost’]

if we only want the ‘Cost’ column for Store 1.

Store 1 22.5
Store 1 2.5
Name: Cost, dtype: float64

How well did you know this?

Not at all

Perfectly

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])

df.T

swap all of the columns with rows.

     Store 1 	Store 1        	Store 2 Cost 	22.5 	2.5 	5 Item Purchased 	Dog Food 	Kitty Litter 	Bird Seed Name 	Chris 	Kevyn 	Vinod

How well did you know this?

Not at all

Perfectly

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])

df.T.loc[‘Cost’]

iloc and loc are used for row selection. That’s why we swapped them.

Store 1 22.5
Store 1 2.5
Store 2 5
Name: Cost, dtype: object

How well did you know this?

Not at all

Perfectly

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])

df[‘Cost’]

columns always have a label in pandas databases.

Store 1 22.5
Store 1 2.5
Store 2 5.0
Name: Cost, dtype: float64

How well did you know this?

Not at all

Perfectly

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])

df.loc[:,[‘Name’, ‘Cost’]]

Give me all ‘Name’ and ‘Cost values for all stores. Can use list or String.

.loc also supports slicing. ‘:’ returns all values. Could use ‘Store 1’ instead

Name Cost

Store 1 Chris 22.5
Store 1 Kevyn 2.5
Store 2 Vinod 5.0

How well did you know this?

Not at all

Perfectly

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=[‘Store 1’, ‘Store 1’, ‘Store 2’])

df.drop(‘Store 1’)

delete copy of ‘Store 1’ from data frame. Original data frame is still intact.

drop has 2 optional params. If inplace param True, it will change original dataframe.

Second parameter is axis which can be dropped. 0 is default for row. 1 switches to column.

copy_df.drop(labels, axis=0, level=None, inplace=False, errors=’raise’)

        Cost 	Item Purchased 	Name Store 2 	5.0 	               Bird Seed 	Vinod

How well did you know this?

Not at all

Perfectly

df.drop(‘Store 1’)

print(df)

Full data frame is still in tact.

        Cost 	Item Purchased 	Name Store 1 	22.5 	Dog Food 	Chris Store 1 	2.5 	        Kitty Litter 	Kevyn Store 2 	5.0 	        Bird Seed 	Vinod

How well did you know this?

Not at all

Perfectly

del copy_df[‘Name’]

easier way of dropping a column.

How well did you know this?

Not at all

Perfectly

df[‘Location’] = None

creates a new column and broadcasts default value to new column.

df[‘Location’] = None # creates a new column and broadcasts default value to new column.

       Cost 	Item Purchased 	Name 	Location Store 1 	22.5 	Dog Food 	Chris 	None Store 1 	2.5 	       Kitty Litter 	Kevyn 	None Store 2 	5.0 	       Bird Seed 	Vinod 	None

How well did you know this?

Not at all

Perfectly

costs = df[‘Cost’]

costs

gives me same results as df[‘Costs’]

Store 1 22.5
Store 1 2.5
Store 2 5.0
Name: Cost, dtype: float64

How well did you know this?

Not at all

Perfectly

costs = df[‘Cost’]

costs + 2

Before:
Store 1 22.5
Store 1 2.5
Store 2 5.0

After:
Store 1    24.5
Store 1     4.5
Store 2     7.0
Name: Cost, dtype: float64

How well did you know this?

Not at all

Perfectly

df = pd.read_csv(‘olympics.csv’)

reads csv file as is and stores it in data frame.

How well did you know this?

Not at all

Perfectly

df = pd.read_csv(‘olympics.csv’, index_col = 0, skiprows=1)

The code reads the csv file in.

skiprows skips the top row of labels so it changes column names.

Params indicate which column should be index & skip first row.

How well did you know this?

Not at all

Perfectly

df.columns

pandas stores all columns in this attribute.

Index([‘№ Summer’, ‘01 !’, ‘02 !’, ‘03 !’, ‘Total’, ‘№ Winter’, ‘01 !.1’, ‘02 !.1’, ‘03 !.1’, ‘Total.1’, ‘№ Games’, ‘01 !.2’, ‘02 !.2’, ‘03 !.2’, ‘Combined total’], dtype=’object’)

How well did you know this?

Not at all

Perfectly

for col in df.columns:
if col[:2]==’01’:
df.rename(columns={col:’Gold’ + col[4:]}, inplace=True)
if col[:2]==’02’:
df.rename(columns={col:’Silver’ + col[4:]}, inplace=True)
if col[:2]==’03’:
df.rename(columns={col:’Bronze’ + col[4:]}, inplace=True)
if col[:1]==’№’:
df.rename(columns={col:’#’ + col[1:]}, inplace=True)

renaming the column names so they are clearer.

if the first two numbers in an element are equal.

inplace tells pandas to update df directly.

How well did you know this?

Not at all

Perfectly

df[‘Feedback’] = [‘Positive’, None, ‘Negative’]

Need to make sure there are enough values.

Create a column w/ the list values in each row.

How well did you know this?

Not at all

Perfectly

adf = df.reset_index()
adf[‘Date’] = pd.Series({0: ‘December 1’, 2: ‘mid-May’})
adf

It’s better to make sure there are key values so we don’t have to type None values.

If you have indexes when adding a list of values to a new column, NaN will be added to rows that are not referenced in dictionary

How well did you know this?

Not at all

Perfectly

pd.merge(staff_df, student_df, how=’outer’, left_index=True, right_index=True)

# we want to do an outer join
# if we want to merge based on a column name and use that column as the index, (left_index = True, right_on = "Column Name")

Role	School
Name		
James	Grader	Business
Kelly	Director of HR	NaN
Mike	NaN	Law
Sally	Course liasion	Engineering

How well did you know this?

Not at all

Perfectly

pd.merge(staff_df, student_df, how=’inner’, left_index=True, right_index=True)

we want to do an inner join.

Role School
Name
James Grader Business
Sally Course liasion Engineering

How well did you know this?

Not at all

Perfectly

pd.merge(staff_df, student_df, how=’left’, left_index=True, right_index=True)

we want all staff but we also want their student info if the’re a student (left join)

	Role	School
Name		
Kelly	Director of HR	NaN
Sally	Course liasion	Engineering
James	Grader	Business

How well did you know this?

Not at all

Perfectly

pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)

we want all student but we also want their staff info if the're a staff (right join) ``` Role School Name James Grader Business Mike NaN Law Sally Course liasion Engineering ```

staff_df = staff_df.reset_index() student_df = student_df.reset_index() pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')

you can use columns instead of indices to join on. Name Role School 0 Kelly Director of HR NaN 1 Sally Course liasion Engineering 2 James Grader Business

pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')

demonstrates what happens when we have a conflict in data. Adds _x for left column and _y for right column.

pd.merge(staff_df, student_df, how='inner', left_on=['First Name','Last Name'], right_on=['First Name','Last Name'])

``` # many staff and students might having matching first names, but not matching last names. # the inner join only returns the row in which both keys match so only Sally Brooks is returned. ```

(df.where(df['SUMLEV']==50) .dropna() .set_index(['STNAME','CTYNAME']) .rename(columns={'ESTIMATESBASE2010': 'Estimates Base 2010'}))

``` # demonstrates method chaining # if you begin w/ an open parentheses you can use it over multiple lines. ```

``` def min_max(row): data = row[['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']] return pd.Series({'min': np.min(data), 'max': np.max(data)}) ``` df.apply(min_max, axis=1)

``` # see map function api. Takes a function and applies it to all items in a list. # apply takes a function and the axis on which we want the function to operate on. # axis = 1 applies function across all rows. ``` ``` max min STNAME CTYNAME Alabama Autauga County 55347.0 54660.0 Baldwin County 203709.0 183193.0 Barbour County 27341.0 26489.0 ```

df.apply(lambda x: np.max(x[rows]), axis=1)

``` # apply functions are used often w/ lambdas # you can chain several apply calls w/ lambdas together. # calculate the max of function using apply call. ```

for state in df['STNAME'].unique(): avg = np.average(df.where(df['STNAME']==state).dropna()['CENSUS2010POP']) print('Counties in state ' + state + ' have an average population of ' + str(avg))

``` # hear we use the census date and get a list of unique states. # for each state we reduce the dataframe and calculate the average. # this takes a long time to run ```

for group, frame in df.groupby('STNAME'): avg = np.average(frame['CENSUS2010POP']) print('Counties in state ' + group + ' have an average population of ' + str(avg))

``` # this does the same thing as for loop w/ unique method, but it is much faster because it uses a groupby object. # we're interested in grouping by state name and we will calculate the average using one column and all of the data in that column. ```

df = df.set_index('STNAME') ``` def fun(item): if item[0] ```

groupby is used 99% of the time on one or more columns, but you can provide a function to group by also. use the fun function to let groupby know how to split up dataframe. if first letter is 'M' is we return 0, "Q' we return 1, and otherwise we return 2

df.groupby('STNAME').agg({'CENSUS2010POP': np.average})

``` # common for groupby to split data, apply some function, and then combine the results. # split, apply, combine pattern. ``` ``` # agg or aggregate applies function to columns of data in group and returns results. # builds a summary data frame # pass in dictionary w/ column and function you want to apply. ``` ``` CENSUS2010POP STNAME Alabama 71339.343284 Alaska 24490.724138 Arizona 426134.466667 Arkansas 38878.906667 California 642309.586207 Colorado 78581.187500 ```

``` # there are two different groupbys, the Series groupby and DataFrame groupby print(type(df.groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'])) ```

DataFrame groupby

``` # there are two different groupbys, the Series groupby and DataFrame groupby print(type(df.groupby(level=0)['POPESTIMATE2010'])) # Series groupby ```

Series Groupby

(df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'] | .agg({'avg': np.average, 'sum': np.sum}))

``` # calls set_index and tells pandas to groupby index using level parameter. # level=0 tells it to use columns(I think) # since there's only one column of data, applies both functions to that column. # notice this is a series ``` ``` sum avg STNAME Alabama 4779736 71339.343284 Alaska 710231 24490.724138 Arizona 6392017 426134.466667 Arkansas 2915918 38878.906667 ```

(df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'] .agg({'avg': np.average, 'sum': np.sum}))

``` # calls set_index and tells pandas to groupby index using level parameter. # does the same thing as above, but on two different columns # notice this is a DataFrame ``` sum avg POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2010 POPESTIMATE2011 STNAME Alabama 4785161 4801108 71420.313433 71658.328358 Alaska 714021 722720 24621.413793 24921.379310

(df.set_index('STNAME').groupby(level=0)['POPESTIMATE2010','POPESTIMATE2011'] .agg({'POPESTIMATE2010': np.average, 'POPESTIMATE2011': np.sum}))

``` # pandas maps the functions directly to columns here instead of creating a hierarchical column. # pandas uses the dictionary keys as the column names. ``` POPESTIMATE2011 POPESTIMATE2010 STNAME Alabama 4801108 71420.313433 Alaska 722720 24621.413793

Four Scales

``` #1. Ratio Scale: # units equally spaced # mathematical operations +-/* are all valid # E.g height and weight measurements # 2. Interval scale # Units are equally spaced, but there is no true zero # Temperature is example. 0 degrees doesn't represent a real zero. # 3. Ordinal Scale # The order of the units is important, but not evenly spaced. # Letter grades such as A+, A are a good example. # when you compare percentage values, regular As represent 5 poiints, but A+ only counta for . # 4. Nominal Scale: # Categories of data, but the categories have no order w/ respect to one another. # E.g Teams of a sport # categories w/ only two possible values are referred to as binary. ```

df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'], index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 'ok', 'ok', 'ok', 'poor', 'poor']) df.rename(columns={0: 'Grades'}, inplace=True) df

starts w/ nominal data which is called category data in pandas. ``` Grades excellent A+ excellent A excellent A- good B+ good B good B- ok C+ ```

df['Grades'].astype('category').head()

``` # instructing pandas to refer to data as categorical data. # dtype has been set to category. ```

grades = df['Grades'].astype('category', categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'], ordered=True) grades.head()

``` # you can change this to ordinal data w/ ordered = True flag. # Ordinal data helps for boolean masking. # if we compare the values lexigraphically, C+ and C- are greater than C. # Ordering the data makes it clear that there is a clear order to the data. ```

grades = df['Grades'].astype('category', categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'], ordered=True) grades > 'C'

``` excellent True excellent True excellent True good True good True good True ok True ok False ok False ```

df = df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg({'avg': np.average}) pd.cut(df['avg'], 10)

``` # cut takes an argument of some array-like structure and an int that represents a number of bins. # separates states into categories by average census populations. ```

df.pivot_table(values='(kW)', index='YEAR', columns='Make', aggfunc=np.mean)

``` # pivot table is in itself a data frame. # allows us to pivot out a column as a column header and to compare it to the other column headers. ```

df.pivot_table(values='(kW)', index='YEAR', columns='Make', aggfunc=[np.mean, np.min], margins=True)

``` # you can pass aggfunc a list of functions you want to apply. # you can pass any function you want to aggfunc including those you write yourself. ```

pd.Timestamp('9/1/2016 10:05AM')

timestamp interchangable w/ datetime in most cases.

pd.Period('1/2016')

creates time period

t1 = pd.Series(list('abc'), [pd.Timestamp('2016-09-01'), pd.Timestamp('2016-09-02'), pd.Timestamp('2016-09-03')]) t1

here each timestamp has a timestamp index 2016-09-01 a 2016-09-02 b 2016-09-03 c

t2 = pd.Series(list('def'), [pd.Period('2016-09'), pd.Period('2016-10'), pd.Period('2016-11')]) t2

has a period index. 2016-09 d 2016-10 e 2016-11 f

d1 = ['2 June 2013', 'Aug 29, 2014', '2015-06-26', '7/12/16'] ts3 = pd.DataFrame(np.random.randint(10, 100, (4,2)), index=d1, columns=list('ab')) ts3

``` # convert to Datetime # uses dates as index and fills rows w/ random values. ``` ``` a b 2 June 2013 99 69 Aug 29, 2014 73 76 2015-06-26 82 71 7/12/16 83 30 ```

ts3.index = pd.to_datetime(ts3.index) | ts3

converts indeces into Datetime format

pd.Timestamp('9/3/2016')-pd.Timestamp('9/1/2016')

tells us exactly how much time passed between two dates.

dates = pd.date_range('10-01-2016', periods=9, freq='2W-SUN') dates

``` # we want to look at 9 measurements taken bi-weekly. # we use date_range to create the indeces. ``` DatetimeIndex(['2016-10-02', '2016-10-16', '2016-10-30', '2016-11-13', '2016-11-27', '2016-12-11', '2016-12-25', '2017-01-08', '2017-01-22'], dtype='datetime64[ns]', freq='2W-SUN')

dates = pd.date_range('10-01-2016', periods=9, freq='2W-SUN') df = pd.DataFrame({'Count 1': 100 + np.random.randint(-5, 10, 9).cumsum(), 'Count 2': 120 + np.random.randint(-5, 10, 9)}, index=dates) df

#creates a Dataframe using dates and random data.

create a list of the days of week for the specific dates.

find the difference between each date's value. Count 1 Count 2 2016-10-02 NaN NaN 2016-10-16 1.0 6.0 2016-10-30 4.0 3.0

df = pd.DataFrame({'Count 1': 100 + np.random.randint(-5, 10, 9).cumsum(), 'Count 2': 120 + np.random.randint(-5, 10, 9)}, index=dates) df.resample('M').mean()

find the mean count for each date in DataFrame. ``` Count 1 Count 2 2016-10-31 103.0 120.0 2016-11-30 105.5 118.0 2016-12-31 107.0 123.0 2017-01-31 106.0 121.0 ```

df['2017']

use partial string indexing to find values from a particular year or month Count 1 Count 2 2017-01-08 108 116 2017-01-22 104 126

df['2016-12']

use partial string indexing to find values from a particular year and month Count 1 Count 2 2016-12-11 104 119 2016-12-25 110 127

df['2016-12':]

``` # slice on a range of dates starting December 2016 # here we only want the dates from Dec. 2016 onward. ```

df.asfreq('W', method='ffill')

change the frequency of dates from bi-weekly to weekly and use forward fill to fill in dates.

import matplotlib.pyplot as plt %matplotlib inline df.plot()

allows you to visualize time series in notebook.

Pandas Data Frame Flashcards

(65 cards)