11 Pandas Udemy Flashcards
Pandas is an open source library built on top of NumPy
I Allows for fas analysis and data cleaning and preparation
It excels in performance and productivity
It also has built-in visualization features
It can work with data from a wide variety of sources
Pandas é o Excel do Python
ou
Versão Python dos R DataFrames
# Creating a Series
labels = [‘a’,’b’,’c’]
my_list = [10,20,30]
pd.Series(data=my_list,index=labels)
==>
### pd.Series(dic) se dic = {'a':10,'b':20,'c':30} também é a mesma coisa (já passa data e index)
a 10
b 20
c 30
dtype: int64
Data in a Series
Even functions (although unlikely that you will use this)
pd.Series([sum,print,len])
==>
0
1
2
dtype: object
ser1 = pd.Series([1,2,3,4],index = [‘USA’, ‘Germany’,’USSR’, ‘Japan’])
ser2 = pd.Series([1,2,5,4],index = [‘USA’, ‘Germany’,’Italy’, ‘Japan’])
ser1 + ser2
==>
Germany 4.0 Italy NaN Japan 8.0 USA 2.0 USSR NaN dtype: float64
Notar que o nome do país é o index e o número é a data
NaN - Not a Number
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split()) df.round(2) ==> W X Y Z A 2.71 0.63 0.91 0.50 B 0.65 -0.32 -0.85 0.61 C -2.02 0.74 0.53 -0.59 D 0.19 -0.76 -0.93 0.96 E 0.19 1.98 2.61 0.68
df['W'].round(2) A 2.71 B 0.65 C -2.02 D 0.19 E 0.19 Name: W, dtype: float64
# SQL Syntax (NOT RECOMMENDED!) df.W # sinônimo da anterior # confunde com métodos se usar
type(df[‘W’])
==>
pandas.core.series.Series
df ==> W X Y Z A 2.71 0.63 0.91 0.50 B 0.65 -0.32 -0.85 0.61 C -2.02 0.74 0.53 -0.59 D 0.19 -0.76 -0.93 0.96 E 0.19 1.98 2.61 0.68
df['new'] = df['Z'] + 1 ==> W X Y Z new A 2.71 0.63 0.91 0.50 1.50 B 0.65 -0.32 -0.85 0.61 1.61 C -2.02 0.74 0.53 -0.59 0.41 D 0.19 -0.76 -0.93 0.96 1.96 E 0.19 1.98 2.61 0.68 1.68
df ==> W X Y Z new A 0.30 1.69 -1.71 -1.16 -0.16 B -0.13 0.39 0.17 0.18 1.18 C 0.81 0.07 0.64 0.33 1.33 D -0.50 -0.75 -0.94 0.48 1.48 E -0.12 1.90 0.24 2.00 3.00
df.drop('new',axis=1) ==> W X Y Z A 0.30 1.69 -1.71 -1.16 B -0.13 0.39 0.17 0.18 C 0.81 0.07 0.64 0.33 D -0.50 -0.75 -0.94 0.48 E -0.12 1.90 0.24 2.00
# Not inplace unless specified! df W X Y Z new A 0.30 1.69 -1.71 -1.16 -0.16 B -0.13 0.39 0.17 0.18 1.18 C 0.81 0.07 0.64 0.33 1.33 D -0.50 -0.75 -0.94 0.48 1.48 E -0.12 1.90 0.24 2.00 3.00
df ==> W X Y Z new A 0.30 1.69 -1.71 -1.16 -0.16 B -0.13 0.39 0.17 0.18 1.18 C 0.81 0.07 0.64 0.33 1.33 D -0.50 -0.75 -0.94 0.48 1.48 E -0.12 1.90 0.24 2.00 3.00
df.drop(‘new’,axis=1,inplace=True)
==>
df
==>
W X Y Z A 0.30 1.69 -1.71 -1.16 B -0.13 0.39 0.17 0.18 C 0.81 0.07 0.64 0.33 D -0.50 -0.75 -0.94 0.48 E -0.12 1.90 0.24 2.00
se inplace=True a mudança ocorre no arquivo original também
df ==> W X Y Z A 0.30 1.69 -1.71 -1.16 B -0.13 0.39 0.17 0.18 C 0.81 0.07 0.64 0.33 D -0.50 -0.75 -0.94 0.48 E -0.12 1.90 0.24 2.00
df.drop(‘E’,axis=0)
==>
W X Y Z A 0.30 1.69 -1.71 -1.16 B -0.13 0.39 0.17 0.18 C 0.81 0.07 0.64 0.33 D -0.50 -0.75 -0.94 0.48
df ==> W X Y Z A 0.30 1.69 -1.71 -1.16 B -0.13 0.39 0.17 0.18 C 0.81 0.07 0.64 0.33 D -0.50 -0.75 -0.94 0.48 E -0.12 1.90 0.24 2.00
df.loc[‘A’] # por nome
** Selecting Rows**
df.iloc[0] # por index
==>
W 0.30 X 1.69 Y -1.71 Z -1.16 Name: A, dtype: float64
df ==> W X Y Z A 0.30 1.69 -1.71 -1.16 B -0.13 0.39 0.17 0.18 C 0.81 0.07 0.64 0.33 D -0.50 -0.75 -0.94 0.48 E -0.12 1.90 0.24 2.00
df.loc[[‘A’,’B’],[‘W’,’Y’]]
==>
W Y
A 0.30 -1.71
B -0.13 0.17
df ==> W X Y Z A 2.71 0.63 0.91 0.50 B 0.65 -0.32 -0.85 0.61 C -2.02 0.74 0.53 -0.59 D 0.19 -0.76 -0.93 0.96 E 0.19 1.98 2.61 0.68
df>0 ==>
W X Y Z A True True True True B True False False True C False True True False D True False False True E True True True True
df[df>0] ==> W X Y Z A 2.71 0.63 0.91 0.50 B 0.65 NaN NaN 0.61 C NaN 0.74 0.53 NaN D 0.19 NaN NaN 0.96 E 0.19 1.98 2.61 0.68
df ==> W X Y Z A 2.71 0.63 0.91 0.50 B 0.65 -0.32 -0.85 0.61 C -2.02 0.74 0.53 -0.59 D 0.19 -0.76 -0.93 0.96 E 0.19 1.98 2.61 0.68
df[df['W']>0][['Y','X']] ==> Y X A 0.91 0.63 B -0.85 -0.32 D -0.93 -0.76 E 2.61 1.98
Não existe “and” que é substituído por “&”
df[(df[‘W’]>0) & (df[‘Y’] > 1)]
==>
W X Y Z new
E 0.19 1.98 2.61 0.68 1.68
Não existe “or” que é substituído por “|”
df[(df['W']>0) | (df['Y'] > 1)] W X Y Z new A 2.71 0.63 0.91 0.50 1.50 B 0.65 -0.32 -0.85 0.61 1.61 D 0.19 -0.76 -0.93 0.96 1.96 E 0.19 1.98 2.61 0.68 1.68
df ==> W X Y Z A 2.71 0.63 0.91 0.50 B 0.65 -0.32 -0.85 0.61 C -2.02 0.74 0.53 -0.59 D 0.19 -0.76 -0.93 0.96 E 0.19 1.98 2.61 0.68
df.reset_index()
==>
index W X Y Z 0 A 2.71 0.63 0.91 0.50 1 B 0.65 -0.32 -0.85 0.61 2 C -2.02 0.74 0.53 -0.59 3 D 0.19 -0.76 -0.93 0.96 4 E 0.19 1.98 2.61 0.68
(inplace=True) se quiser que mude no original
newind = ‘CA NY WY OR CO’.split()
newind
==>
[‘CA’, ‘NY’, ‘WY’, ‘OR’, ‘CO’]
df[‘States’] = newind
df
==>
inplace=True se quiser fazer valer na página ao lado
W X Y Z States A 2.71 0.63 0.91 0.50 CA B 0.65 -0.32 -0.85 0.61 NY C -2.02 0.74 0.53 -0.59 WY D 0.19 -0.76 -0.93 0.96 OR E 0.19 1.98 2.61 0.68 CO
df.set_index('States') ==> W X Y Z States CA 2.71 0.63 0.91 0.50 NY 0.65 -0.32 -0.85 0.61 WY -2.02 0.74 0.53 -0.59 OR 0.19 -0.76 -0.93 0.96 CO 0.19 1.98 2.61 0.68
a linha States fica vazia
Multi-Index and Index Hierarchy
Let us go over how to work with Multi-Index, first we’ll create a quick example of what a Multi-Indexed DataFrame would look like:
Index Levels
outside = [‘G1’,’G1’,’G1’,’G2’,’G2’,’G2’]
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index ==> MultiIndex([('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)], )
df = pd.DataFrame(np.random.randn(6,2), index=hier_index,columns=['A','B']) df = df.round(2) df ==>
A B G1 1 -0.50 -0.75 2 -0.94 0.48 3 -0.12 1.90 G2 1 0.24 2.00 2 -0.99 0.20 3 -1.14 0.00
Now let’s show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:
df.loc['G1'].loc[1] ==> A 0.153661 B 0.167638 Name: 1, dtype: float64
df.index.names ==>
FrozenList([None, None])
df.index.names = ['Group','Num'] df ==> A B Group Num G1 1 -0.50 -0.75 2 -0.94 0.48 3 -0.12 1.90 G2 1 0.24 2.00 2 -0.99 0.20 3 -1.14 0.00
df.loc[‘G2’].loc[2].loc[‘B’]
==>
0.2
# xs = Cross Section df.xs(1,level='Num') ==> A B Group G1 -0.50 -0.75 G2 0.24 2.00
df = pd.DataFrame({'A':[1,2,np.nan], 'B':[5,np.nan,np.nan], 'C':[1,2,3]}) df ==> A B C 0 1.0 5.0 1 1 2.0 NaN 2 2 NaN NaN 3
deleta todas as linhas com NaN
df.dropna() ==>
A B C
0 1.0 5.0 1
# deleta todas as colunas com NaN df.dropna(axis=1) => C 0 1 1 2 2 3
df ==> A B C 0 1.0 5.0 1 1 2.0 NaN 2 2 NaN NaN 3
df.dropna(thresh=2)
==>
A B C
0 1.0 5.0 1
1 2.0 NaN 2
se thresh=2, só deleta a partir de 2 NaN (null/none)
df ==> A B C 0 1.0 5.0 1 1 2.0 NaN 2 2 NaN NaN 3
df.fillna(value=’FILL VALUE’)
==>
A B C
0 1 5 1
1 2 FILL VALUE 2
2 FILL VALU FILL VALUE 3
substitui os NaN pelo escrito em value=
df ==> A B C 0 1.0 5.0 1 1 2.0 NaN 2 2 NaN NaN 3
df[‘A’].fillna(value=df[‘A’].mean())
==>
0 1.0
1 2.0
2 1.5
Name: A, dtype: float64
substituiu pela média do resto da coluna A
Create dataframe
data = {‘Company’:[‘GGL’,’GGL’,’IBM’,’IBM’,’FB’,’FB’],
‘Person’:[‘Sam’,’Cass’,’Amy’,’Van’,’Carl’,’Sara’],
‘Sales’:[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df ==>
Company Person Sales 0 GGL Sam 200 1 GGL Cass 120 2 IBM Amy 340 3 IBM Van 124 4 FB Carl 243 5 FB Sara 350
** Now you can use the .groupby() method to group rows together based off of a column name. For instance let’s group based off of Company. This will create a DataFrameGroupBy object:**
df.groupby(‘Company’)
==> ERROR
# You can save this object as a new variable: by_comp = df.groupby("Company") ==>
# And then call aggregate methods off the object: by_comp.mean() ==> Sales Company FB 296.5 GGL 160.0 IBM 232.0
linha única
df.groupby(‘Company’).mean()
==>
(junção das duas anteriores)
sum(), std(), min(), max(), count(), describe()
by_comp.count().loc[‘FB’] ==>
Person 2
Sales 2
Name: FB, dtype: int64
# by_comp.describe().transpose()['GGL'] Sales count 2.000000 mean 160.000000 std 56.568542 min 120.000000 25% 140.000000 50% 160.000000 75% 180.000000 max 200.000000 Name: GGL, dtype: float64 # mostra na vertical e não na horizontal os dados
df1 ==> A B C D 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3
df2 ==> A B C D 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7
pd.concat([df1,df2]) ==>
A B C D 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7
# default is axis=0 (rows) # padrão é eixo 0, linhas
df1 ==> A B C D 0 A0 B0 C0 D0 1 A1 B1 C1 D1 2 A2 B2 C2 D2 3 A3 B3 C3 D3
df2 ==> A B C D 4 A4 B4 C4 D4 5 A5 B5 C5 D5 6 A6 B6 C6 D6 7 A7 B7 C7 D7
pd.concat([df1,df2], axis=1) ==>
A B C D A B C D 0 A0 B0 C0 D0 NaN NaN NaN NaN 1 A1 B1 C1 D1 NaN NaN NaN NaN 2 A2 B2 C2 D2 NaN NaN NaN NaN 3 A3 B3 C3 D3 NaN NaN NaN NaN 4 NaN NaN NaN NaN A4 B4 C4 D4 5 NaN NaN NaN NaN A5 B5 C5 D5 6 NaN NaN NaN NaN A6 B6 C6 D6 7 NaN NaN NaN NaN A7 B7 C7 D7
axis 1 é o das colunas
left ==> key A B 0 K0 A0 B0 1 K1 A1 B1 2 K2 A2 B2 3 K3 A3 B3
right ==> key C D 0 K0 C0 D0 1 K1 C1 D1 2 K2 C2 D2 3 K3 C3 D3
pd.merge(left,right,how=’inner’,on=’key’)
==>
A B key C D 0 A0 B0 K0 C0 D0 1 A1 B1 K1 C1 D1 2 A2 B2 K2 C2 D2 3 A3 B3 K3 C3 D3
# Merging The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. For example:
Merging se baseia na coluna e não no Index
left ==> key1 key2 A B 0 K0 K0 A0 B0 1 K0 K1 A1 B1 2 K1 K0 A2 B2 3 K2 K1 A3 B3
right ==> key1 key2 C D 0 K0 K0 C0 D0 1 K1 K0 C1 D1 2 K1 K0 C2 D2 3 K2 K0 C3 D3
pd.merge(left, right, on=[‘key1’, ‘key2’])
==>
key1 key2A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
notar que juntou L0 com R0, L2 com R1, L2 com R2 (linha 2 de Right)
left ==> key1 key2 A B 0 K0 K0 A0 B0 1 K0 K1 A1 B1 2 K1 K0 A2 B2 3 K2 K1 A3 B3
right ==> key1 key2 C D 0 K0 K0 C0 D0 1 K1 K0 C1 D1 2 K1 K0 C2 D2 3 K2 K0 C3 D3
pd.merge(left, right, how=’outer’, on=[‘key1’, ‘key2’]) ==>
key1 key2 A B C D 0 K0 K0 A0 B0 C0 D0 1 K0 K1 A1 B1 NaN NaN 2 K1 K0 A2 B2 C1 D1 3 K1 K0 A2 B2 C2 D2 4 K2 K1 A3 B3 NaN NaN 5 K2 K0 NaN NaN C3 D3
left ==> key1 key2 A B 0 K0 K0 A0 B0 1 K0 K1 A1 B1 2 K1 K0 A2 B2 3 K2 K1 A3 B3
right ==> key1 key2 C D 0 K0 K0 C0 D0 1 K1 K0 C1 D1 2 K1 K0 C2 D2 3 K2 K0 C3 D3
pd.merge(left, right, how=’left’, on=[‘key1’, ‘key2’]) ==>
A B key1 key2 C D 0 A0 B0 K0 K0 C0 D0 1 A1 B1 K0 K1 NaN NaN 2 A2 B2 K1 K0 C1 D1 3 A2 B2 K1 K0 C2 D2 4 A3 B3 K2 K1 NaN NaN
Tem que ter na da esquerda, se how=’right’seria o contrário
left ==> A B K0 A0 B0 K1 A1 B1 K2 A2 B2
right ==> C D K0 C0 D0 K2 C2 D2 K3 C3 D3
left.join(right) ==> A B C D K0 A0 B0 C0 D0 K1 A1 B1 NaN NaN K2 A2 B2 C2 D2
# Joining Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. # Diferença de Joining pra Merge é que Joining usa Index e Merge usa coluna
left ==> A B K0 A0 B0 K1 A1 B1 K2 A2 B2
right ==> C D K0 C0 D0 K2 C2 D2 K3 C3 D3
left.join(right, how='outer') ==> A B C D K0 A0 B0 C0 D0 K1 A1 B1 NaN NaN K2 A2 B2 C2 D2 K3 NaN NaN C3 D3
# Joining Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. # Diferença de Joining pra Merge é que Joining usa Index e Merge usa coluna
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']}) df.head() ==> col1 col2 col3 0 1 444 abc 1 2 555 def 2 3 666 ghi 3 4 444 xyz
df[‘col2’].unique()
==>
array([444, 555, 666], dtype=int64)
df['col2'].nunique() == len(df['col2'].unique()) ==> 3
df['col2'].value_counts() ==> df['col2'].value_counts() 444 2 555 1 666 1 Name: col2, dtype: int64
df ==> ==> col1 col2 col3 0 1 444 abc 1 2 555 def 2 3 666 ghi 3 4 444 xyz
newdf = df[(df[‘col1’]>2) & (df[‘col2’]==444)]
newdf
==>
col1 col2 col3
3 4 444 xyz
df ==> col1 col2 col3 0 1 444 abc 1 2 555 def 2 3 666 ghi 3 4 444 xyz
def times2(x): return x*2 ==>
df['col1'].apply(times2) ==> 0 3 1 3 2 3 3 3 Name: col3, dtype: int64
df[‘col1’].sum() ==>
10
df['col2'].apply(lambda x: x*2) ==> 0 888 1 1110 2 1332 3 888 Name: col2, dtype: int64
** Permanently Removing a Column**
del df[‘col1’]
df ==> col1 col2 col3 0 1 444 abc 1 2 555 def 2 3 666 ghi 3 4 444 xyz
df.sort_values(by='col2') == df.sort_values('col2') #inplace=False by default ==>
col1 col2 col3 0 1 444 abc 3 4 444 xyz 1 2 555 def 2 3 666 ghi
PIVOT TABLES
data = {‘A’:[‘foo’,’foo’,’foo’,’bar’,’bar’,’bar’],
‘B’:[‘one’,’one’,’two’,’two’,’one’,’one’],
‘C’:[‘x’,’y’,’x’,’y’,’x’,’y’],
‘D’:[1,3,2,5,4,1]}
df = pd.DataFrame(data) df ==> A B C D 0 foo one x 1 1 foo one y 3 2 foo two x 2 3 bar two y 5 4 bar one x 4 5 bar one y 1
C x y A B bar one 4.0 1.0 two NaN 5.0 foo one 1.0 3.0 two 2.0 NaN
PIVOT TABLES
data = {‘A’:[‘foo’,’foo’,’foo’,’bar’,’bar’,’bar’],
‘B’:[‘one’,’one’,’two’,’two’,’one’,’one’],
‘C’:[‘x’,’y’,’x’,’y’,’x’,’y’],
‘D’:[1,3,2,5,4,1]}
df = pd.DataFrame(data) df ==> A B C D 0 foo one x 1 1 foo one y 3 2 foo two x 2 3 bar two y 5 4 bar one x 4 5 bar one y 1
df.pivot_table(values=’D’,index=[‘A’, ‘B’],columns=[‘C’]) ==>
C x y A B bar one 4.0 1.0 two NaN 5.0 foo one 1.0 3.0 two 2.0 NaN
Panda can work with a variety of files: CSV Excel HTML SQL
df = pd.read_csv(‘example’)
df
==>
Pandas pode ler de vários outros tipos de arquivos
df.to_csv(‘example’,index=False)
????
Excel
Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash.
Não importa imagens, gráficos ou fórmulas
Pandas só importa dados do Excel