11 Pandas Udemy Flashcards by Artur Faria

Pandas is an open source library built on top of NumPy

I Allows for fas analysis and data cleaning and preparation

It excels in performance and productivity

It also has built-in visualization features

It can work with data from a wide variety of sources

Pandas é o Excel do Python

Versão Python dos R DataFrames

How well did you know this?

Not at all

Perfectly

# Creating a Series

labels = [‘a’,’b’,’c’]
my_list = [10,20,30]
pd.Series(data=my_list,index=labels)
==>

### pd.Series(dic) se dic = {'a':10,'b':20,'c':30}
também é a mesma coisa (já passa data e index)

a 10
b 20
c 30
dtype: int64

How well did you know this?

Not at all

Perfectly

Data in a Series

Even functions (although unlikely that you will use this)

pd.Series([sum,print,len])
==>

0
1
2
dtype: object

How well did you know this?

Not at all

Perfectly

ser1 = pd.Series([1,2,3,4],index = [‘USA’, ‘Germany’,’USSR’, ‘Japan’])
ser2 = pd.Series([1,2,5,4],index = [‘USA’, ‘Germany’,’Italy’, ‘Japan’])
ser1 + ser2
==>

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

Notar que o nome do país é o index e o número é a data

NaN - Not a Number

How well did you know this?

Not at all

Perfectly

df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df.round(2)
==>
W	X	Y	Z
A	2.71	0.63	0.91	0.50
B	0.65	-0.32	-0.85	0.61
C	-2.02	0.74	0.53	-0.59
D	0.19	-0.76	-0.93	0.96
E	0.19	1.98	2.61	0.68

df['W'].round(2)
A    2.71
B    0.65
C   -2.02
D    0.19
E    0.19
Name: W, dtype: float64

# SQL Syntax (NOT RECOMMENDED!)
df.W
# sinônimo da anterior
# confunde com métodos se usar

type(df[‘W’])
==>
pandas.core.series.Series

How well did you know this?

Not at all

Perfectly

df ==>
W	X	Y	Z
A	2.71	0.63	0.91	0.50
B	0.65	-0.32	-0.85	0.61
C	-2.02	0.74	0.53	-0.59
D	0.19	-0.76	-0.93	0.96
E	0.19	1.98	2.61	0.68

df['new'] = df['Z'] + 1
==>
W	X	Y	Z	new
A	2.71	0.63	0.91	0.50	1.50
B	0.65	-0.32	-0.85	0.61	1.61
C	-2.02	0.74	0.53	-0.59	0.41
D	0.19	-0.76	-0.93	0.96	1.96
E	0.19	1.98	2.61	0.68	1.68

How well did you know this?

Not at all

Perfectly

df ==>
W	X	Y	Z	new
A	0.30	1.69	-1.71	-1.16	-0.16
B	-0.13	0.39	0.17	0.18	1.18
C	0.81	0.07	0.64	0.33	1.33
D	-0.50	-0.75	-0.94	0.48	1.48
E	-0.12	1.90	0.24	2.00	3.00

df.drop('new',axis=1)
==> 
	W	X	Y	Z
A	0.30	1.69	-1.71	-1.16
B	-0.13	0.39	0.17	0.18
C	0.81	0.07	0.64	0.33
D	-0.50	-0.75	-0.94	0.48
E	-0.12	1.90	0.24	2.00

# Not inplace unless specified!
df
	W	X	Y	Z	new
A	0.30	1.69	-1.71	-1.16	-0.16
B	-0.13	0.39	0.17	0.18	1.18
C	0.81	0.07	0.64	0.33	1.33
D	-0.50	-0.75	-0.94	0.48	1.48
E	-0.12	1.90	0.24	2.00	3.00

How well did you know this?

Not at all

Perfectly

df ==>
W	X	Y	Z	new
A	0.30	1.69	-1.71	-1.16	-0.16
B	-0.13	0.39	0.17	0.18	1.18
C	0.81	0.07	0.64	0.33	1.33
D	-0.50	-0.75	-0.94	0.48	1.48
E	-0.12	1.90	0.24	2.00	3.00

df.drop(‘new’,axis=1,inplace=True)
==>

df
==>

W	X	Y	Z
A	0.30	1.69	-1.71	-1.16
B	-0.13	0.39	0.17	0.18
C	0.81	0.07	0.64	0.33
D	-0.50	-0.75	-0.94	0.48
E	-0.12	1.90	0.24	2.00

se inplace=True a mudança ocorre no arquivo original também

How well did you know this?

Not at all

Perfectly

df ==>
W	X	Y	Z
A	0.30	1.69	-1.71	-1.16
B	-0.13	0.39	0.17	0.18
C	0.81	0.07	0.64	0.33
D	-0.50	-0.75	-0.94	0.48
E	-0.12	1.90	0.24	2.00

df.drop(‘E’,axis=0)
==>

W	X	Y	Z
A	0.30	1.69	-1.71	-1.16
B	-0.13	0.39	0.17	0.18
C	0.81	0.07	0.64	0.33
D	-0.50	-0.75	-0.94	0.48

How well did you know this?

Not at all

Perfectly

df ==>
W	X	Y	Z
A	0.30	1.69	-1.71	-1.16
B	-0.13	0.39	0.17	0.18
C	0.81	0.07	0.64	0.33
D	-0.50	-0.75	-0.94	0.48
E	-0.12	1.90	0.24	2.00

df.loc[‘A’] # por nome

** Selecting Rows**

df.iloc[0] # por index

==>

W    0.30
X    1.69
Y   -1.71
Z   -1.16
Name: A, dtype: float64

How well did you know this?

Not at all

Perfectly

df ==>
W	X	Y	Z
A	0.30	1.69	-1.71	-1.16
B	-0.13	0.39	0.17	0.18
C	0.81	0.07	0.64	0.33
D	-0.50	-0.75	-0.94	0.48
E	-0.12	1.90	0.24	2.00

df.loc[[‘A’,’B’],[‘W’,’Y’]]
==>

W Y
A 0.30 -1.71
B -0.13 0.17

How well did you know this?

Not at all

Perfectly

df ==>
W	X	Y	Z
A	2.71	0.63	0.91	0.50
B	0.65	-0.32	-0.85	0.61
C	-2.02	0.74	0.53	-0.59
D	0.19	-0.76	-0.93	0.96
E	0.19	1.98	2.61	0.68

df>0 ==>

W	X	Y	Z
A	True	True	True	True
B	True	False	False	True
C	False	True	True	False
D	True	False	False	True
E	True	True	True	True

df[df>0]
==>
W	X	Y	Z
A	2.71	0.63	0.91	0.50
B	0.65	NaN	NaN	0.61
C	NaN	0.74	0.53	NaN
D	0.19	NaN	NaN	0.96
E	0.19	1.98	2.61	0.68

How well did you know this?

Not at all

Perfectly

df ==>
W	X	Y	Z
A	2.71	0.63	0.91	0.50
B	0.65	-0.32	-0.85	0.61
C	-2.02	0.74	0.53	-0.59
D	0.19	-0.76	-0.93	0.96
E	0.19	1.98	2.61	0.68

df[df['W']>0][['Y','X']]
==>
	Y	X
A	0.91	0.63
B	-0.85	-0.32
D	-0.93	-0.76
E	2.61	1.98

Não existe “and” que é substituído por “&”

df[(df[‘W’]>0) & (df[‘Y’] > 1)]
==>
W X Y Z new
E 0.19 1.98 2.61 0.68 1.68

Não existe “or” que é substituído por “|”

df[(df['W']>0) | (df['Y'] > 1)]
W	X	Y	Z	new
A	2.71	0.63	0.91	0.50	1.50
B	0.65	-0.32	-0.85	0.61	1.61
D	0.19	-0.76	-0.93	0.96	1.96
E	0.19	1.98	2.61	0.68	1.68

How well did you know this?

Not at all

Perfectly

df ==>
W	X	Y	Z
A	2.71	0.63	0.91	0.50
B	0.65	-0.32	-0.85	0.61
C	-2.02	0.74	0.53	-0.59
D	0.19	-0.76	-0.93	0.96
E	0.19	1.98	2.61	0.68

df.reset_index()
==>

index	W	X	Y	Z
0	A	2.71	0.63	0.91	0.50
1	B	0.65	-0.32	-0.85	0.61
2	C	-2.02	0.74	0.53	-0.59
3	D	0.19	-0.76	-0.93	0.96
4	E	0.19	1.98	2.61	0.68

(inplace=True) se quiser que mude no original

How well did you know this?

Not at all

Perfectly

newind = ‘CA NY WY OR CO’.split()
newind
==>
[‘CA’, ‘NY’, ‘WY’, ‘OR’, ‘CO’]

df[‘States’] = newind
df
==>

inplace=True se quiser fazer valer na página ao lado

W	X	Y	Z	States
A	2.71	0.63	0.91	0.50	CA
B	0.65	-0.32	-0.85	0.61	NY
C	-2.02	0.74	0.53	-0.59	WY
D	0.19	-0.76	-0.93	0.96	OR
E	0.19	1.98	2.61	0.68	CO

df.set_index('States')
==>
W	X	Y	Z
States				
CA	2.71	0.63	0.91	0.50
NY	0.65	-0.32	-0.85	0.61
WY	-2.02	0.74	0.53	-0.59
OR	0.19	-0.76	-0.93	0.96
CO	0.19	1.98	2.61	0.68

a linha States fica vazia

How well did you know this?

Not at all

Perfectly

Multi-Index and Index Hierarchy
Let us go over how to work with Multi-Index, first we’ll create a quick example of what a Multi-Indexed DataFrame would look like:

Index Levels
outside = [‘G1’,’G1’,’G1’,’G2’,’G2’,’G2’]
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)

hier_index ==>
MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

df = pd.DataFrame(np.random.randn(6,2),
index=hier_index,columns=['A','B'])
df = df.round(2)
df
==>

		A	        B
G1	1	-0.50	-0.75
        2	-0.94	0.48
        3	-0.12	1.90
G2	1	0.24	2.00
        2	-0.99	0.20
        3	-1.14  	0.00

Study These Flashcards

Now let’s show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

df.loc['G1'].loc[1]
==>
A    0.153661
B    0.167638
Name: 1, dtype: float64

df.index.names ==>
FrozenList([None, None])

df.index.names = ['Group','Num']
df
==>
		            A	         B
Group	Num		
G1	           1	-0.50	-0.75
                   2	-0.94	0.48
                   3	-0.12	1.90
G2          	   1	0.24	2.00
                    2 -0.99	0.20
                    3	-1.14	        0.00

df.loc[‘G2’].loc[2].loc[‘B’]
==>
0.2

# xs = Cross Section
df.xs(1,level='Num')
==>
	     A	B
Group		
G1	       -0.50	-0.75
G2	        0.24	2.00

df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})
df ==>
	A	  B	        C
0	1.0	  5.0	        1
1	2.0	  NaN	2
2	NaN  NaN	3

Study These Flashcards

deleta todas as linhas com NaN
df.dropna() ==>
A B C
0 1.0 5.0 1

# deleta todas as colunas com NaN
df.dropna(axis=1)
=>
	C
0	1
1	2
2	3

df ==>
	A	  B	        C
0	1.0	  5.0	        1
1	2.0	  NaN	2
2	NaN  NaN	3

df.dropna(thresh=2)
==>

Study These Flashcards

A B C
0 1.0 5.0 1
1 2.0 NaN 2

se thresh=2, só deleta a partir de 2 NaN (null/none)

df ==>
	A	  B	        C
0	1.0	  5.0	        1
1	2.0	  NaN	2
2	NaN  NaN	3

df.fillna(value=’FILL VALUE’)
==>

Study These Flashcards

A B C
0 1 5 1
1 2 FILL VALUE 2
2 FILL VALU FILL VALUE 3

substitui os NaN pelo escrito em value=

df ==>
	A	  B	        C
0	1.0	  5.0	        1
1	2.0	  NaN	2
2	NaN  NaN	3

df[‘A’].fillna(value=df[‘A’].mean())
==>

Study These Flashcards

0 1.0
1 2.0
2 1.5
Name: A, dtype: float64

substituiu pela média do resto da coluna A

Create dataframe
data = {‘Company’:[‘GGL’,’GGL’,’IBM’,’IBM’,’FB’,’FB’],
‘Person’:[‘Sam’,’Cass’,’Amy’,’Van’,’Carl’,’Sara’],
‘Sales’:[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df ==>

Company	Person	Sales
0	GGL	Sam	        200
1	GGL	Cass	120
2	IBM	        Amy	340
3	IBM	        Van	         124
4	FB	        Carl	        243
5	FB	        Sara	350

** Now you can use the .groupby() method to group rows together based off of a column name. For instance let’s group based off of Company. This will create a DataFrameGroupBy object:**
df.groupby(‘Company’)
==> ERROR

Study These Flashcards

# You can save this object as a new variable:
by_comp = df.groupby("Company")
==>

# And then call aggregate methods off the object:
by_comp.mean() ==>
	             Sales
Company	
FB	              296.5
GGL	      160.0
IBM	              232.0

linha única
df.groupby(‘Company’).mean()
==>
(junção das duas anteriores)

sum(), std(), min(), max(), count(), describe()

by_comp.count().loc[‘FB’] ==>
Person 2
Sales 2
Name: FB, dtype: int64

# by_comp.describe().transpose()['GGL']
Sales  count      2.000000
       mean     160.000000
       std       56.568542
       min      120.000000
       25%      140.000000
       50%      160.000000
       75%      180.000000
       max      200.000000
Name: GGL, dtype: float64
# mostra na vertical e não na horizontal os dados

df1 ==>
        A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3

df2 ==>
         A	B	C	D
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7

pd.concat([df1,df2]) ==>

Study These Flashcards

A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7

# default is axis=0 (rows)
# padrão é eixo 0, linhas

df1 ==>
        A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3

df2 ==>
         A	B	C	D
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7

pd.concat([df1,df2], axis=1) ==>

Study These Flashcards

A	B	C	D	A	B	C	D
0	A0	B0	C0	D0	NaN	NaN	NaN	NaN
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	NaN	NaN	NaN	NaN
3	A3	B3	C3	D3	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	A4	B4	C4	D4
5	NaN	NaN	NaN	NaN	A5	B5	C5	D5
6	NaN	NaN	NaN	NaN	A6	B6	C6	D6
7	NaN	NaN	NaN	NaN	A7	B7	C7	D7

axis 1 é o das colunas

left ==>
        key	A	B
0	K0	A0	B0
1	K1	A1	B1
2	K2	A2	B2
3	K3	A3	B3

right ==>
	key	C	D
0	K0	C0	D0
1	K1	C1	D1
2	K2	C2	D2
3	K3	C3	D3

pd.merge(left,right,how=’inner’,on=’key’)
==>

Study These Flashcards

A	B	key	C	D
0	A0	B0	K0	C0	D0
1	A1	B1	K1	C1	D1
2	A2	B2	K2	C2	D2
3	A3	B3	K3	C3	D3

# Merging
The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. For example:

Merging se baseia na coluna e não no Index

``` left ==> key1 key2 A B 0 K0 K0 A0 B0 1 K0 K1 A1 B1 2 K1 K0 A2 B2 3 K2 K1 A3 B3 ``` ``` right ==> key1 key2 C D 0 K0 K0 C0 D0 1 K1 K0 C1 D1 2 K1 K0 C2 D2 3 K2 K0 C3 D3 ``` pd.merge(left, right, on=['key1', 'key2']) ==>

key1 key2A B C D 0 K0 K0 A0 B0 C0 D0 1 K1 K0 A2 B2 C1 D1 2 K1 K0 A2 B2 C2 D2 notar que juntou L0 com R0, L2 com R1, L2 com R2 (linha 2 de Right)

``` left ==> key1 key2 A B 0 K0 K0 A0 B0 1 K0 K1 A1 B1 2 K1 K0 A2 B2 3 K2 K1 A3 B3 ``` ``` right ==> key1 key2 C D 0 K0 K0 C0 D0 1 K1 K0 C1 D1 2 K1 K0 C2 D2 3 K2 K0 C3 D3 ``` pd.merge(left, right, how='outer', on=['key1', 'key2']) ==>

``` key1 key2 A B C D 0 K0 K0 A0 B0 C0 D0 1 K0 K1 A1 B1 NaN NaN 2 K1 K0 A2 B2 C1 D1 3 K1 K0 A2 B2 C2 D2 4 K2 K1 A3 B3 NaN NaN 5 K2 K0 NaN NaN C3 D3 ```

``` left ==> key1 key2 A B 0 K0 K0 A0 B0 1 K0 K1 A1 B1 2 K1 K0 A2 B2 3 K2 K1 A3 B3 ``` ``` right ==> key1 key2 C D 0 K0 K0 C0 D0 1 K1 K0 C1 D1 2 K1 K0 C2 D2 3 K2 K0 C3 D3 ``` pd.merge(left, right, how='left', on=['key1', 'key2']) ==>

``` A B key1 key2 C D 0 A0 B0 K0 K0 C0 D0 1 A1 B1 K0 K1 NaN NaN 2 A2 B2 K1 K0 C1 D1 3 A2 B2 K1 K0 C2 D2 4 A3 B3 K2 K1 NaN NaN ``` Tem que ter na da esquerda, se how='right'seria o contrário

``` left ==> A B K0 A0 B0 K1 A1 B1 K2 A2 B2 ``` ``` right ==> C D K0 C0 D0 K2 C2 D2 K3 C3 D3 ```

``` left.join(right) ==> A B C D K0 A0 B0 C0 D0 K1 A1 B1 NaN NaN K2 A2 B2 C2 D2 ``` ``` # Joining Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. # Diferença de Joining pra Merge é que Joining usa Index e Merge usa coluna ```

``` left ==> A B K0 A0 B0 K1 A1 B1 K2 A2 B2 ``` ``` right ==> C D K0 C0 D0 K2 C2 D2 K3 C3 D3 ```

``` left.join(right, how='outer') ==> A B C D K0 A0 B0 C0 D0 K1 A1 B1 NaN NaN K2 A2 B2 C2 D2 K3 NaN NaN C3 D3 ``` ``` # Joining Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. # Diferença de Joining pra Merge é que Joining usa Index e Merge usa coluna ```

``` df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']}) df.head() ==> col1 col2 col3 0 1 444 abc 1 2 555 def 2 3 666 ghi 3 4 444 xyz ```

df['col2'].unique() ==> array([444, 555, 666], dtype=int64) ``` df['col2'].nunique() == len(df['col2'].unique()) ==> 3 ``` ``` df['col2'].value_counts() ==> df['col2'].value_counts() 444 2 555 1 666 1 Name: col2, dtype: int64 ```

``` df ==> ==> col1 col2 col3 0 1 444 abc 1 2 555 def 2 3 666 ghi 3 4 444 xyz ``` newdf = df[(df['col1']>2) & (df['col2']==444)] newdf ==>

col1 col2 col3 | 3 4 444 xyz

``` df ==> col1 col2 col3 0 1 444 abc 1 2 555 def 2 3 666 ghi 3 4 444 xyz ``` ``` def times2(x): return x*2 ==> ```

``` df['col1'].apply(times2) ==> 0 3 1 3 2 3 3 3 Name: col3, dtype: int64 ``` df['col1'].sum() ==> 10 ``` df['col2'].apply(lambda x: x*2) ==> 0 888 1 1110 2 1332 3 888 Name: col2, dtype: int64 ```

** Permanently Removing a Column**

del df['col1']

``` df ==> col1 col2 col3 0 1 444 abc 1 2 555 def 2 3 666 ghi 3 4 444 xyz ``` ``` df.sort_values(by='col2') == df.sort_values('col2') #inplace=False by default ==> ```

``` col1 col2 col3 0 1 444 abc 3 4 444 xyz 1 2 555 def 2 3 666 ghi ```

PIVOT TABLES data = {'A':['foo','foo','foo','bar','bar','bar'], 'B':['one','one','two','two','one','one'], 'C':['x','y','x','y','x','y'], 'D':[1,3,2,5,4,1]} ``` df = pd.DataFrame(data) df ==> A B C D 0 foo one x 1 1 foo one y 3 2 foo two x 2 3 bar two y 5 4 bar one x 4 5 bar one y 1 ```

``` C x y A B bar one 4.0 1.0 two NaN 5.0 foo one 1.0 3.0 two 2.0 NaN ```

``` Panda can work with a variety of files: CSV Excel HTML SQL ```

df = pd.read_csv('example') df ==> Pandas pode ler de vários outros tipos de arquivos df.to_csv('example',index=False) ????

Excel Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash.

Não importa imagens, gráficos ou fórmulas Pandas só importa dados do Excel

11 Pandas Udemy Flashcards

(38 cards)