4 Data Manipulation with pandas Flashcards

ipynb https://colab.research.google.com/drive/1sl64J9JdozMl39t0PZ3KJrYUqoYXKDgV?usp=sharing

1
Q

1 Print the first 4 rows of the dataframe

import pandas as pd
data = [[‘tom’, 10], [‘nick’, 15],[‘john’, 29]]
df = pd.DataFrame(data, columns = [‘Name’, ‘Age’])

A

df.head(2)

Name  Age 0   tom   10 1  nick   15
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

2 What does .info() ?

A

Shows information on each of the columns, such as the data type and number of missing values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

3 Get the number of rows and columns of the df

      Number Letter
0       0      h
1       1      o
2       2      u
3       3      s
4       4      e
#Output:
(5, 2)
A

df.shape

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

4 Get the mean, count, quartiles and other statistics with one line of code

   Number  Double
0       0       0
1       1       2
2       2       4
3       3       6
4       4       8
A

df.describe()

#Output
Number  Double
count     5.0     5.0
mean      2.0     4.0
std       1.6     3.2
min       0.0     0.0
25%       1.0     2.0
50%       2.0     4.0
75%       3.0     6.0
max       4.0     8.0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

5 What does the attribute .values?

A

Gets a two-dimensional NumPy array of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

6 Get the columns names of df

name  age 0   tom   10 1   nick   15
A

df.columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

7 What does the attribute .index ?

A

An index for the rows: either row numbers or row names.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

8 Sort df according to name

Name  Age 0   tom   10 1  nick   15 2  juli   14
A

print(df.sort_values(by =’name’))

name age
1 nick 15
0 tom 10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

9 Sort by name and Age (descending)

Name Age
0 tom 10
1 tom 15
2 juli 14

A

print(df.sort_values([“Name”, “Age”],ascending=[True,False]))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

10 Subset column name as dataframe

Name Age
0 tom 10
1 ana 15
2 juli 14

A

df[[‘Name’]]

name
0 tom
1 nick

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

11 Filter age equal or greater than 14

name age
0 tom 10
1 nick 15
2 ana 17

A

df[df[‘Age’]>=14]

name age
1 nick 15
2 ana 17

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

12 Filter age equal or greater than 14 and name not tom

name  age 0   tom   10 1   nick   15 2   ana   17
A

df[(df[‘age’]>=14) & (df[‘name’]!=’tom’)]

name age
1 nick 15

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
#13 Subsetting rows by categorical variables
#Get tom and ana with a conditional subsetting
name  age 0   tom   10 1   nick   15 2   ana   17
A

names=[‘tom’,’ana’]

condition = df[‘Name’].isin(names)

chosen =df[condition]
print(chosen)

name age
0 tom 10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

14 Add a column. Populate it doubling values of column age

name age double
0 tom 10 20
1 nick 15 30

A

df[‘double’] = df.age*2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

15 Add a column called age_2 and subset it using a conditional filter with a cutoff of 120

name age Age_2
1 nick 15 150

A

df[‘Age_2’] = df.age*10
old_filter = df[‘Age_2’]>120
df_old =df[old_filter]
print(df_old)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

16 What is summary statistics?

A

Information that gives a quick and simple description of the data. Can include mean, median, mode, minimum value, maximum value, range, standard deviation, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

17 Print the mean of column age

name  age 1  nick   15 2   ana   17
A

print(df.age.mean())

12.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

18 Get maximum value and the minimum of column age

Name Age
0 peter 10
1 ana 15
2 tom 14

A

print(‘max:’, df.age.max())

print(‘min:’, df.age.min())

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

19 What is IQR in Statistics?

A

IQR describes the middle 50% of values when ordered from lowest to highest. To find the interquartile range (IQR), ​first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

20 Create a custom a IQR function for a dataframe

A
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

21 Get the iqr of the column Double:

Output:
   A  B
0  0  0
1  1  2
2  2  4

IQR_B: 2.0

A
def iqr(column):
  return column.quantile(0.75) - column.quantile(0.25)

print(df)
print(‘’)
print(‘IQR_B:’, iqr(df.B))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q
#22 Create a DataFrame using a loop and getting random integers (col A).
#B is double of A.
#C is cumulative sum of A
#D cumulative max of A (seed 123)
#Output:
A   B   C  D
0  2   4   2  2
1  2   4   4  2
2  6  12  10  6
A

import numpy as np
import pandas as pd

np.random.seed(123)
a = []
for i in range(3):
a.append(np.random.randint(0,10))

df=pd.DataFrame(a, columns =['A'])
df['B'] =df.A*2
df['C']=df.A.cumsum()
df['D']=df.A.cummax()
print(df)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

23 Drop duplicates

 name  age Uno    tom   10 Dos   nick   15 Tres  nick   15
A

df.drop_duplicates(subset=’name’,inplace = True)
print(df)

 name  age Uno   tom   10 Dos  nick   15
24
Q

24 Count different values in column name

 0   1 0   tom  10 1  nick  15 2  nick  15

nick 2
tom 1
Name: 0, dtype: int64

A

df[0].value_counts()

25
#25 Count the percentage of frequencies in column Age in ascending order name age 0 tom 10 1 nick 15 2 nick 15 #Output: 10 0.3 15 0.7 Name: age, dtype: float64
round(df.age.value_counts(normalize=True, sort=False),1)
26
#26 Subset rows containing tom and sum their ages name age 0 tom 10 1 nick 15 2 nick 19
df[df['name']=='nick']['age'].sum()
27
#26 Subset rows containing tom and sum their ages name age 0 tom 10 1 nick 15 2 nick 19
df[df['name']=='nick']['age'].sum()
28
#27 Group by names and sum their money name money 0 tom 10 1 nick 15 2 nick 19 ``` #Output name nick 34 tom 10 Name: money, dtype: int64 ```
df.groupby(by='name')['money'].sum()
29
#28 Group by name and calculate mean and maximum value for age column name money 0 tom 10 1 nick 15 2 nick 19 #Output mean amax name nick 17 19 tom 10 10
df.groupby(by='name')['money'].agg([np.mean,np.max])
30
#29 Use pivot table to calculate the mean of ages for each name name age death 0 tom 18 90 1 nick 15 75 2 nick 23 115
df.pivot_table(index='Name', values='age',aggfunc='mean') age name nick 19
31
#30 Use pivot table to calculate the mean of ages and death, and get totals name age death 0 nick 15 75 1 nick 23 115
print(round(df.pivot_table(index='name', values=['age','death'], aggfunc='mean',margins =True),1)) ``` #Output: age death name nick 19.0 95.0 tom 18.0 90.0 All 18.7 93.3 ```
32
#31 Transform pivot table ‘xx’ into dataframe ``` age death name nick 19.0 95.0 tom 18.0 90.0 All 18.7 93.3 ```
xx.reset_index(inplace=True) #Output name age death 0 nick 19.0 95.0 1 tom 18.0 90.0 2 All 18.7 93.3
33
#32 Get tom with a boolean name money 0 tom 10 1 nick 15 2 nick 19 ``` #Output 0 True 1 False 2 False Name: name, dtype: bool ```
selection= ['tom'] | df.name.isin(selection)
34
#33 Get tom using a list. Subset data frame name money 0 tom 10 1 nick 15 2 nick 19 #Output Name Age Death 1 tom 12 108 2 tom 10 90
selection = ['tom'] | print(df[df.name.isin(selection)])
35
#34 Multi index (year,month)and subset with loc ``` month year sale 0 1 2012 55 1 4 2015 40 2 4 2016 84 3 10 2014 31 4 9 2013 45 ``` #Output sale year month 2012 1 55 2014 10 31
df.set_index(['year','month'],inplace=True) rows_to_keep = [(2012,1),(2014,10)] print(df.loc[rows_to_keep])
36
#35 Add multiindex and sort ``` month year sale 0 1 2012 55 1 4 2015 40 2 4 2016 84 3 10 2014 31 4 9 2013 45 ``` ``` #Output: sale year month 2016 4 84 2015 4 40 2014 10 31 2013 9 45 2012 1 55 ```
df.set_index(['year','month'],inplace=True) | print(df.sort_index(level=['year','month'], ascending=[False,True]))
37
#36 Set index (year), sort and slice with loc (years,2012 to 2014) ``` month year sale 0 1 2012 55 1 4 2015 40 2 4 2016 84 3 10 2014 31 4 9 2013 45 ``` #Output ``` month sale year 2012 1 55 2013 9 45 2014 10 31 ```
df.set_index('year',inplace=True) df.sort_index(inplace=True) print(df.loc[2012:2014])
38
#37 Create a column with the length of ‘name’ #Output Name len 0 peter 5 1 ford 4 2 santana 7
df['len']=df['Name'].str.len()
39
``` #39 Create a dataframe with shape 3,4 #with random number with mean 0 and variance 1 #time index from January 1 2013 to January 3 ``` np.random.seed(123) #Output 0 1 2013-01-01 2.1 -2.8 2013-01-02 -0.4 0.2 2013-01-03 0.6 -0.2
df = pd.DataFrame(np.random.randn(3, 2), index=pd.date_range('20130101', periods=3)) print(round(df,1))
40
#40 What are offset aliases?
A number of string aliases given to useful common time series frequencies. e.g ‘Y’==year
41
#41 Slice in both directions. 0 1 2013-01-01 2.1 -2.8 2013-01-02 -0.4 0.2 2013-01-03 0.6 -0.2 #Output 0 2013-01-01 0.550524 2013-01-02 -1.108726
print(df.iloc[:2,:1])
42
#42 Slice Time series starting with 2014 0 2013-12-31 -0.539127 2014-12-31 0.055680 2015-12-31 0.136186
print(df.loc['2014':]) #Output: 0 2014-12-31 0.055680 2015-12-31 0.136186
43
#43 Populate column year 0 2013-12-31 -0.539127 2014-12-31 0.055680 2015-12-31 0.136186 #Output 0 year 2013-12-31 -0.167130 2013 2014-12-31 0.025054 2014 2015-12-31 -0.885574 2015
df['year'] = df.index.year
44
#44 Reset index (date)and change its name for ‘date’ using a dictionary 0 2013-12-31 1.069279 2014-12-31 0.598574 2015-12-31 -0.247078
df=df.reset_index().rename(columns={'index': 'bar'}) #Output date 0 0 2013-12-31 0.285032 1 2014-12-31 2.681187 2 2015-12-31 -0.249268
45
#45 Delete column col 'len’ from df Name len 0 peter 5 1 ford 4 2 santana 7
del(df['len']) ``` #Output: Name 0 peter 1 ford 2 santana ```
46
#46 Transform dataframe into pivot table A B C 0 Fer Masters NaN 1 Ger Graduate 22.0 2 John Graduate NaN #Output: ``` B Graduate Masters A Fer NaN NaN Ger 22.0 NaN John NaN NaN ```
df.pivot('A', 'B', 'C') Or df.pivot(index ='A', columns ='B', values ='C')
47
#47 Subsetting pivot tables (pt) ``` B Graduate Masters A Boby 23.0 NaN John NaN 27.0 Mina 21.0 NaN ``` B Graduate Masters A Fer NaN NaN Ger 22.0 NaN
pt_sub = pt.loc['Fer':'Ger'] | print(pt_sub)
48
#48 Calculate the mean of the columns in this pivot table (pt) ``` B Graduate Masters A Boby 23.0 NaN John NaN 27.0 Mina 21.0 NaN ``` ``` #Output: B Graduate 22.0 Masters NaN dtype: float64 ```
pt.mean()
49
#49 Group by titles, calculate the age mean and visualize it with a bar plot (matplotlib) A B C 0 John Masters 27 1 Boby Graduate 23 2 Mina Graduate 21 #Output (plot in ipynb doc) B Graduate 20.5 Masters 15.0 Name: C, dtype: float64
import matplotlib.pyplot as plt gb = df.groupby('B')['C'].mean() gb.plot(kind ='bar') plt.show()
50
#50 Group by dates and visualize column 0 with a line chart dates 0 0 2013-12-31 0.537526 1 2014-12-31 -0.290812 2 2015-12-31 0.679620 #Output (plot in ipynb doc) ``` dates 2013-12-31 0.537526 2014-12-31 -0.290812 2015-12-31 0.679620 Name: 0, dtype: float64 ```
``` import matplotlib.pyplot as plt df_grouped=df.groupby(by='dates')[0].sum() df_grouped.plot(kind='line') print(df_grouped) plt.show() ```
51
#51 Scatter plot with 0 in the x axis and Scatter as title import pandas as pd df = pd.DataFrame(np.random.randn(3)) print(df)
import matplotlib.pyplot as plt df. reset_index().plot(kind='scatter', x='index', y=0) plt. show() #Output: (plot in ipynb doc)
52
#52 display two histograms showing Titles (C ).Transparency 0.5 and two bins df = pd.DataFrame({'Uni': ['Harvard', 'UOC', 'Stanford','Tsinghua'], 'Title': ['Master', 'MBI', 'Master','MBI'], 'Students': [15, 22, 19,13]})
df[df['Title']=='MBI']['Students'].hist(alpha=0.5,bins=2) df[df['Title']=='Master']['Students'].hist(alpha=0.5,bins=2) plt.legend(['MBI','Master']) plt.show()
53
#53 Replace number 21 in df with NaN values ``` Uni Title Students 0 Harvard Master 21 1 UOC MBI 21 2 Stanford Master 19 3 Tsinghua MBI 13 ```
df.replace({21:np.nan}) ``` #Output: Uni Title Students 0 Harvard Master NaN 1 UOC MBI NaN 2 Stanford Master 19.0 3 Tsinghua MBI 13.0 ```
54
#54 Check if there are NaN values for each column ``` Uni Title Students 0 Harvard Master NaN 1 UOC MBI NaN 2 Stanford Master 19.0 3 Tsinghua MBI 13.0 ```
df.isna().any() #Uni False Title False Students True dtype: bool
55
#55 Count Nan values in each column and in the total ``` Uni Title Students 0 Harvard Master NaN 1 UOC MBI NaN 2 Stanford Master 19.0 3 Tsinghua MBI 13.0 ```
nan_cols=df.isna().sum() nan_total=df.isna().sum().sum() print('each col:\n', nan_cols) print('total:',nan_total) ``` #Output: each col: Uni 0 Title 0 Students 2 dtype: int64 total: 2 ```
56
#56 Delete Nan values Uni Title Students 0 Harvard Master NaN 1 UOC MBI NaN 2 Stanford Master 19.0
df.dropna() #Output: Uni Title Students 2 Stanford Master 19.0
57
#57 Transform NaN values into 0s Uni Title Students 0 Harvard Master NaN 1 UOC MBI NaN 2 Stanford Master 19.0
df.fillna(0) #Output: Uni Title Students 0 Harvard Master 0.0 1 UOC MBI 0.0 2 Stanford Master 19.0