11 Pandas Udemy Flashcards
Pandas is an open source library built on top of NumPy
I Allows for fas analysis and data cleaning and preparation
It excels in performance and productivity
It also has built-in visualization features
It can work with data from a wide variety of sources
Pandas é o Excel do Python
ou
Versão Python dos R DataFrames
# Creating a Series
labels = [‘a’,’b’,’c’]
my_list = [10,20,30]
pd.Series(data=my_list,index=labels)
==>
### pd.Series(dic) se dic = {'a':10,'b':20,'c':30} também é a mesma coisa (já passa data e index)
a 10
b 20
c 30
dtype: int64
Data in a Series
Even functions (although unlikely that you will use this)
pd.Series([sum,print,len])
==>
0
1
2
dtype: object
ser1 = pd.Series([1,2,3,4],index = [‘USA’, ‘Germany’,’USSR’, ‘Japan’])
ser2 = pd.Series([1,2,5,4],index = [‘USA’, ‘Germany’,’Italy’, ‘Japan’])
ser1 + ser2
==>
Germany 4.0 Italy NaN Japan 8.0 USA 2.0 USSR NaN dtype: float64
Notar que o nome do país é o index e o número é a data
NaN - Not a Number
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split()) df.round(2) ==> W X Y Z A 2.71 0.63 0.91 0.50 B 0.65 -0.32 -0.85 0.61 C -2.02 0.74 0.53 -0.59 D 0.19 -0.76 -0.93 0.96 E 0.19 1.98 2.61 0.68
df['W'].round(2) A 2.71 B 0.65 C -2.02 D 0.19 E 0.19 Name: W, dtype: float64
# SQL Syntax (NOT RECOMMENDED!) df.W # sinônimo da anterior # confunde com métodos se usar
type(df[‘W’])
==>
pandas.core.series.Series
df ==> W X Y Z A 2.71 0.63 0.91 0.50 B 0.65 -0.32 -0.85 0.61 C -2.02 0.74 0.53 -0.59 D 0.19 -0.76 -0.93 0.96 E 0.19 1.98 2.61 0.68
df['new'] = df['Z'] + 1 ==> W X Y Z new A 2.71 0.63 0.91 0.50 1.50 B 0.65 -0.32 -0.85 0.61 1.61 C -2.02 0.74 0.53 -0.59 0.41 D 0.19 -0.76 -0.93 0.96 1.96 E 0.19 1.98 2.61 0.68 1.68
df ==> W X Y Z new A 0.30 1.69 -1.71 -1.16 -0.16 B -0.13 0.39 0.17 0.18 1.18 C 0.81 0.07 0.64 0.33 1.33 D -0.50 -0.75 -0.94 0.48 1.48 E -0.12 1.90 0.24 2.00 3.00
df.drop('new',axis=1) ==> W X Y Z A 0.30 1.69 -1.71 -1.16 B -0.13 0.39 0.17 0.18 C 0.81 0.07 0.64 0.33 D -0.50 -0.75 -0.94 0.48 E -0.12 1.90 0.24 2.00
# Not inplace unless specified! df W X Y Z new A 0.30 1.69 -1.71 -1.16 -0.16 B -0.13 0.39 0.17 0.18 1.18 C 0.81 0.07 0.64 0.33 1.33 D -0.50 -0.75 -0.94 0.48 1.48 E -0.12 1.90 0.24 2.00 3.00
df ==> W X Y Z new A 0.30 1.69 -1.71 -1.16 -0.16 B -0.13 0.39 0.17 0.18 1.18 C 0.81 0.07 0.64 0.33 1.33 D -0.50 -0.75 -0.94 0.48 1.48 E -0.12 1.90 0.24 2.00 3.00
df.drop(‘new’,axis=1,inplace=True)
==>
df
==>
W X Y Z A 0.30 1.69 -1.71 -1.16 B -0.13 0.39 0.17 0.18 C 0.81 0.07 0.64 0.33 D -0.50 -0.75 -0.94 0.48 E -0.12 1.90 0.24 2.00
se inplace=True a mudança ocorre no arquivo original também
df ==> W X Y Z A 0.30 1.69 -1.71 -1.16 B -0.13 0.39 0.17 0.18 C 0.81 0.07 0.64 0.33 D -0.50 -0.75 -0.94 0.48 E -0.12 1.90 0.24 2.00
df.drop(‘E’,axis=0)
==>
W X Y Z A 0.30 1.69 -1.71 -1.16 B -0.13 0.39 0.17 0.18 C 0.81 0.07 0.64 0.33 D -0.50 -0.75 -0.94 0.48
df ==> W X Y Z A 0.30 1.69 -1.71 -1.16 B -0.13 0.39 0.17 0.18 C 0.81 0.07 0.64 0.33 D -0.50 -0.75 -0.94 0.48 E -0.12 1.90 0.24 2.00
df.loc[‘A’] # por nome
** Selecting Rows**
df.iloc[0] # por index
==>
W 0.30 X 1.69 Y -1.71 Z -1.16 Name: A, dtype: float64
df ==> W X Y Z A 0.30 1.69 -1.71 -1.16 B -0.13 0.39 0.17 0.18 C 0.81 0.07 0.64 0.33 D -0.50 -0.75 -0.94 0.48 E -0.12 1.90 0.24 2.00
df.loc[[‘A’,’B’],[‘W’,’Y’]]
==>
W Y
A 0.30 -1.71
B -0.13 0.17
df ==> W X Y Z A 2.71 0.63 0.91 0.50 B 0.65 -0.32 -0.85 0.61 C -2.02 0.74 0.53 -0.59 D 0.19 -0.76 -0.93 0.96 E 0.19 1.98 2.61 0.68
df>0 ==>
W X Y Z A True True True True B True False False True C False True True False D True False False True E True True True True
df[df>0] ==> W X Y Z A 2.71 0.63 0.91 0.50 B 0.65 NaN NaN 0.61 C NaN 0.74 0.53 NaN D 0.19 NaN NaN 0.96 E 0.19 1.98 2.61 0.68
df ==> W X Y Z A 2.71 0.63 0.91 0.50 B 0.65 -0.32 -0.85 0.61 C -2.02 0.74 0.53 -0.59 D 0.19 -0.76 -0.93 0.96 E 0.19 1.98 2.61 0.68
df[df['W']>0][['Y','X']] ==> Y X A 0.91 0.63 B -0.85 -0.32 D -0.93 -0.76 E 2.61 1.98
Não existe “and” que é substituído por “&”
df[(df[‘W’]>0) & (df[‘Y’] > 1)]
==>
W X Y Z new
E 0.19 1.98 2.61 0.68 1.68
Não existe “or” que é substituído por “|”
df[(df['W']>0) | (df['Y'] > 1)] W X Y Z new A 2.71 0.63 0.91 0.50 1.50 B 0.65 -0.32 -0.85 0.61 1.61 D 0.19 -0.76 -0.93 0.96 1.96 E 0.19 1.98 2.61 0.68 1.68
df ==> W X Y Z A 2.71 0.63 0.91 0.50 B 0.65 -0.32 -0.85 0.61 C -2.02 0.74 0.53 -0.59 D 0.19 -0.76 -0.93 0.96 E 0.19 1.98 2.61 0.68
df.reset_index()
==>
index W X Y Z 0 A 2.71 0.63 0.91 0.50 1 B 0.65 -0.32 -0.85 0.61 2 C -2.02 0.74 0.53 -0.59 3 D 0.19 -0.76 -0.93 0.96 4 E 0.19 1.98 2.61 0.68
(inplace=True) se quiser que mude no original
newind = ‘CA NY WY OR CO’.split()
newind
==>
[‘CA’, ‘NY’, ‘WY’, ‘OR’, ‘CO’]
df[‘States’] = newind
df
==>
inplace=True se quiser fazer valer na página ao lado
W X Y Z States A 2.71 0.63 0.91 0.50 CA B 0.65 -0.32 -0.85 0.61 NY C -2.02 0.74 0.53 -0.59 WY D 0.19 -0.76 -0.93 0.96 OR E 0.19 1.98 2.61 0.68 CO
df.set_index('States') ==> W X Y Z States CA 2.71 0.63 0.91 0.50 NY 0.65 -0.32 -0.85 0.61 WY -2.02 0.74 0.53 -0.59 OR 0.19 -0.76 -0.93 0.96 CO 0.19 1.98 2.61 0.68
a linha States fica vazia