4 Data Manipulation with pandas Flashcards
ipynb https://colab.research.google.com/drive/1sl64J9JdozMl39t0PZ3KJrYUqoYXKDgV?usp=sharing
1 Print the first 4 rows of the dataframe
import pandas as pd
data = [[‘tom’, 10], [‘nick’, 15],[‘john’, 29]]
df = pd.DataFrame(data, columns = [‘Name’, ‘Age’])
df.head(2)
Name Age 0 tom 10 1 nick 15
2 What does .info() ?
Shows information on each of the columns, such as the data type and number of missing values.
3 Get the number of rows and columns of the df
Number Letter 0 0 h 1 1 o 2 2 u 3 3 s 4 4 e
#Output: (5, 2)
df.shape
4 Get the mean, count, quartiles and other statistics with one line of code
Number Double 0 0 0 1 1 2 2 2 4 3 3 6 4 4 8
df.describe()
#Output Number Double count 5.0 5.0 mean 2.0 4.0 std 1.6 3.2 min 0.0 0.0 25% 1.0 2.0 50% 2.0 4.0 75% 3.0 6.0 max 4.0 8.0
5 What does the attribute .values?
Gets a two-dimensional NumPy array of values.
6 Get the columns names of df
name age 0 tom 10 1 nick 15
df.columns
7 What does the attribute .index ?
An index for the rows: either row numbers or row names.
8 Sort df according to name
Name Age 0 tom 10 1 nick 15 2 juli 14
print(df.sort_values(by =’name’))
name age
1 nick 15
0 tom 10
9 Sort by name and Age (descending)
Name Age
0 tom 10
1 tom 15
2 juli 14
print(df.sort_values([“Name”, “Age”],ascending=[True,False]))
10 Subset column name as dataframe
Name Age
0 tom 10
1 ana 15
2 juli 14
df[[‘Name’]]
name
0 tom
1 nick
11 Filter age equal or greater than 14
name age
0 tom 10
1 nick 15
2 ana 17
df[df[‘Age’]>=14]
name age
1 nick 15
2 ana 17
12 Filter age equal or greater than 14 and name not tom
name age 0 tom 10 1 nick 15 2 ana 17
df[(df[‘age’]>=14) & (df[‘name’]!=’tom’)]
name age
1 nick 15
#13 Subsetting rows by categorical variables #Get tom and ana with a conditional subsetting
name age 0 tom 10 1 nick 15 2 ana 17
names=[‘tom’,’ana’]
condition = df[‘Name’].isin(names)
chosen =df[condition]
print(chosen)
name age
0 tom 10
14 Add a column. Populate it doubling values of column age
name age double
0 tom 10 20
1 nick 15 30
df[‘double’] = df.age*2
15 Add a column called age_2 and subset it using a conditional filter with a cutoff of 120
name age Age_2
1 nick 15 150
df[‘Age_2’] = df.age*10
old_filter = df[‘Age_2’]>120
df_old =df[old_filter]
print(df_old)
16 What is summary statistics?
Information that gives a quick and simple description of the data. Can include mean, median, mode, minimum value, maximum value, range, standard deviation, etc
17 Print the mean of column age
name age 1 nick 15 2 ana 17
print(df.age.mean())
12.5
18 Get maximum value and the minimum of column age
Name Age
0 peter 10
1 ana 15
2 tom 14
print(‘max:’, df.age.max())
print(‘min:’, df.age.min())
19 What is IQR in Statistics?
IQR describes the middle 50% of values when ordered from lowest to highest. To find the interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.
20 Create a custom a IQR function for a dataframe
def iqr(column): return column.quantile(0.75) - column.quantile(0.25)
21 Get the iqr of the column Double:
Output: A B 0 0 0 1 1 2 2 2 4
IQR_B: 2.0
def iqr(column): return column.quantile(0.75) - column.quantile(0.25)
print(df)
print(‘’)
print(‘IQR_B:’, iqr(df.B))
#22 Create a DataFrame using a loop and getting random integers (col A). #B is double of A. #C is cumulative sum of A #D cumulative max of A (seed 123)
#Output: A B C D 0 2 4 2 2 1 2 4 4 2 2 6 12 10 6
import numpy as np
import pandas as pd
np.random.seed(123)
a = []
for i in range(3):
a.append(np.random.randint(0,10))
df=pd.DataFrame(a, columns =['A']) df['B'] =df.A*2 df['C']=df.A.cumsum() df['D']=df.A.cummax() print(df)
23 Drop duplicates
name age Uno tom 10 Dos nick 15 Tres nick 15
df.drop_duplicates(subset=’name’,inplace = True)
print(df)
name age Uno tom 10 Dos nick 15
24 Count different values in column name
0 1 0 tom 10 1 nick 15 2 nick 15
nick 2
tom 1
Name: 0, dtype: int64
df[0].value_counts()
25 Count the percentage of frequencies in column Age in ascending order
name age 0 tom 10 1 nick 15 2 nick 15
10 0.3
15 0.7
Name: age, dtype: float64
round(df.age.value_counts(normalize=True, sort=False),1)
26 Subset rows containing tom and sum their ages
name age 0 tom 10 1 nick 15 2 nick 19
df[df[‘name’]==’nick’][‘age’].sum()
26 Subset rows containing tom and sum their ages
name age 0 tom 10 1 nick 15 2 nick 19
df[df[‘name’]==’nick’][‘age’].sum()
27 Group by names and sum their money
name money 0 tom 10 1 nick 15 2 nick 19
#Output name nick 34 tom 10 Name: money, dtype: int64
df.groupby(by=’name’)[‘money’].sum()
28 Group by name and calculate mean and maximum value for age column
name money 0 tom 10 1 nick 15 2 nick 19
mean amax name nick 17 19 tom 10 10
df.groupby(by=’name’)[‘money’].agg([np.mean,np.max])
29 Use pivot table to calculate the mean of ages for each name
name age death
0 tom 18 90
1 nick 15 75
2 nick 23 115
df.pivot_table(index=’Name’, values=’age’,aggfunc=’mean’)
age
name
nick 19
30 Use pivot table to calculate the mean of ages and death, and get totals
name age death
0 nick 15 75
1 nick 23 115
print(round(df.pivot_table(index=’name’, values=[‘age’,’death’],
aggfunc=’mean’,margins =True),1))
#Output: age death name nick 19.0 95.0 tom 18.0 90.0 All 18.7 93.3
31 Transform pivot table ‘xx’ into dataframe
age death name nick 19.0 95.0 tom 18.0 90.0 All 18.7 93.3
Output
xx.reset_index(inplace=True)
name age death
0 nick 19.0 95.0
1 tom 18.0 90.0
2 All 18.7 93.3
32 Get tom with a boolean
name money
0 tom 10
1 nick 15
2 nick 19
#Output 0 True 1 False 2 False Name: name, dtype: bool
selection= [‘tom’]
df.name.isin(selection)
33 Get tom using a list. Subset data frame
name money
0 tom 10
1 nick 15
2 nick 19
Name Age Death
1 tom 12 108
2 tom 10 90
selection = [‘tom’]
print(df[df.name.isin(selection)])
34 Multi index (year,month)and subset with loc
month year sale 0 1 2012 55 1 4 2015 40 2 4 2016 84 3 10 2014 31 4 9 2013 45
sale year month 2012 1 55 2014 10 31
df.set_index([‘year’,’month’],inplace=True)
rows_to_keep = [(2012,1),(2014,10)]
print(df.loc[rows_to_keep])
35 Add multiindex and sort
month year sale 0 1 2012 55 1 4 2015 40 2 4 2016 84 3 10 2014 31 4 9 2013 45
#Output: sale year month 2016 4 84 2015 4 40 2014 10 31 2013 9 45 2012 1 55
df.set_index([‘year’,’month’],inplace=True)
print(df.sort_index(level=[‘year’,’month’], ascending=[False,True]))
36 Set index (year), sort and slice with loc (years,2012 to 2014)
month year sale 0 1 2012 55 1 4 2015 40 2 4 2016 84 3 10 2014 31 4 9 2013 45
month sale year 2012 1 55 2013 9 45 2014 10 31
df.set_index(‘year’,inplace=True)
df.sort_index(inplace=True)
print(df.loc[2012:2014])
37 Create a column with the length of ‘name’
Name len 0 peter 5 1 ford 4 2 santana 7
df[‘len’]=df[‘Name’].str.len()
Output
#39 Create a dataframe with shape 3,4 #with random number with mean 0 and variance 1 #time index from January 1 2013 to January 3
np.random.seed(123)
0 1 2013-01-01 2.1 -2.8 2013-01-02 -0.4 0.2 2013-01-03 0.6 -0.2
df = pd.DataFrame(np.random.randn(3, 2),
index=pd.date_range(‘20130101’, periods=3))
print(round(df,1))
40 What are offset aliases?
A number of string aliases given to useful common time series frequencies.
e.g ‘Y’==year
41 Slice in both directions.
0 1 2013-01-01 2.1 -2.8 2013-01-02 -0.4 0.2 2013-01-03 0.6 -0.2
0 2013-01-01 0.550524 2013-01-02 -1.108726
print(df.iloc[:2,:1])
42 Slice Time series starting with 2014
0 2013-12-31 -0.539127 2014-12-31 0.055680 2015-12-31 0.136186
Output:
print(df.loc[‘2014’:])
0 2014-12-31 0.055680 2015-12-31 0.136186
43 Populate column year
0 2013-12-31 -0.539127 2014-12-31 0.055680 2015-12-31 0.136186
0 year 2013-12-31 -0.167130 2013 2014-12-31 0.025054 2014 2015-12-31 -0.885574 2015
df[‘year’] = df.index.year
44 Reset index (date)and change its name for ‘date’ using a dictionary
0 2013-12-31 1.069279 2014-12-31 0.598574 2015-12-31 -0.247078
Output
df=df.reset_index().rename(columns={‘index’: ‘bar’})
date 0
0 2013-12-31 0.285032
1 2014-12-31 2.681187
2 2015-12-31 -0.249268
45 Delete column col ‘len’ from df
Name len 0 peter 5 1 ford 4 2 santana 7
del(df[‘len’])
#Output: Name 0 peter 1 ford 2 santana
46 Transform dataframe into pivot table
A B C
0 Fer Masters NaN
1 Ger Graduate 22.0
2 John Graduate NaN
B Graduate Masters A Fer NaN NaN Ger 22.0 NaN John NaN NaN
df.pivot(‘A’, ‘B’, ‘C’)
Or
df.pivot(index =’A’, columns =’B’, values =’C’)
47 Subsetting pivot tables (pt)
B Graduate Masters A Boby 23.0 NaN John NaN 27.0 Mina 21.0 NaN
B Graduate Masters
A
Fer NaN NaN
Ger 22.0 NaN
pt_sub = pt.loc[‘Fer’:’Ger’]
print(pt_sub)
48 Calculate the mean of the columns in this pivot table (pt)
B Graduate Masters A Boby 23.0 NaN John NaN 27.0 Mina 21.0 NaN
#Output: B Graduate 22.0 Masters NaN dtype: float64
pt.mean()
49 Group by titles, calculate the age mean and visualize it with a bar plot (matplotlib)
A B C 0 John Masters 27 1 Boby Graduate 23 2 Mina Graduate 21
B
Graduate 20.5
Masters 15.0
Name: C, dtype: float64
import matplotlib.pyplot as plt
gb = df.groupby(‘B’)[‘C’].mean()
gb.plot(kind =’bar’)
plt.show()
50 Group by dates and visualize column 0 with a line chart
dates 0
0 2013-12-31 0.537526
1 2014-12-31 -0.290812
2 2015-12-31 0.679620
dates 2013-12-31 0.537526 2014-12-31 -0.290812 2015-12-31 0.679620 Name: 0, dtype: float64
import matplotlib.pyplot as plt df_grouped=df.groupby(by='dates')[0].sum() df_grouped.plot(kind='line') print(df_grouped) plt.show()
51 Scatter plot with 0 in the x axis and Scatter as title
import pandas as pd
df = pd.DataFrame(np.random.randn(3))
print(df)
Output: (plot in ipynb doc)
import matplotlib.pyplot as plt
df. reset_index().plot(kind=’scatter’, x=’index’, y=0)
plt. show()
52 display two histograms showing Titles (C ).Transparency 0.5 and two bins
df = pd.DataFrame({‘Uni’: [‘Harvard’, ‘UOC’, ‘Stanford’,’Tsinghua’],
‘Title’: [‘Master’, ‘MBI’, ‘Master’,’MBI’],
‘Students’: [15, 22, 19,13]})
df[df[‘Title’]==’MBI’][‘Students’].hist(alpha=0.5,bins=2)
df[df[‘Title’]==’Master’][‘Students’].hist(alpha=0.5,bins=2)
plt.legend([‘MBI’,’Master’])
plt.show()
53 Replace number 21 in df with NaN values
Uni Title Students 0 Harvard Master 21 1 UOC MBI 21 2 Stanford Master 19 3 Tsinghua MBI 13
df.replace({21:np.nan})
#Output: Uni Title Students 0 Harvard Master NaN 1 UOC MBI NaN 2 Stanford Master 19.0 3 Tsinghua MBI 13.0
54 Check if there are NaN values for each column
Uni Title Students 0 Harvard Master NaN 1 UOC MBI NaN 2 Stanford Master 19.0 3 Tsinghua MBI 13.0
Uni False
df.isna().any()
Title False
Students True
dtype: bool
55 Count Nan values in each column and in the total
Uni Title Students 0 Harvard Master NaN 1 UOC MBI NaN 2 Stanford Master 19.0 3 Tsinghua MBI 13.0
nan_cols=df.isna().sum()
nan_total=df.isna().sum().sum()
print(‘each col:\n’, nan_cols)
print(‘total:’,nan_total)
#Output: each col: Uni 0 Title 0 Students 2 dtype: int64 total: 2
56 Delete Nan values
Uni Title Students 0 Harvard Master NaN 1 UOC MBI NaN 2 Stanford Master 19.0
Output:
df.dropna()
Uni Title Students 2 Stanford Master 19.0
57 Transform NaN values into 0s
Uni Title Students 0 Harvard Master NaN 1 UOC MBI NaN 2 Stanford Master 19.0
Output:
df.fillna(0)
Uni Title Students
0 Harvard Master 0.0
1 UOC MBI 0.0
2 Stanford Master 19.0