4 Data Manipulation with pandas Flashcards by Birwe Leon

1 Print the first 4 rows of the dataframe

import pandas as pd
data = [[‘tom’, 10], [‘nick’, 15],[‘john’, 29]]
df = pd.DataFrame(data, columns = [‘Name’, ‘Age’])

df.head(2)

Name  Age 0   tom   10 1  nick   15

How well did you know this?

Not at all

Perfectly

2 What does .info() ?

Shows information on each of the columns, such as the data type and number of missing values.

How well did you know this?

Not at all

Perfectly

3 Get the number of rows and columns of the df

      Number Letter
0       0      h
1       1      o
2       2      u
3       3      s
4       4      e

#Output:
(5, 2)

df.shape

How well did you know this?

Not at all

Perfectly

4 Get the mean, count, quartiles and other statistics with one line of code

   Number  Double
0       0       0
1       1       2
2       2       4
3       3       6
4       4       8

df.describe()

#Output
Number  Double
count     5.0     5.0
mean      2.0     4.0
std       1.6     3.2
min       0.0     0.0
25%       1.0     2.0
50%       2.0     4.0
75%       3.0     6.0
max       4.0     8.0

How well did you know this?

Not at all

Perfectly

5 What does the attribute .values?

Gets a two-dimensional NumPy array of values.

How well did you know this?

Not at all

Perfectly

6 Get the columns names of df

name  age 0   tom   10 1   nick   15

df.columns

How well did you know this?

Not at all

Perfectly

7 What does the attribute .index ?

An index for the rows: either row numbers or row names.

How well did you know this?

Not at all

Perfectly

8 Sort df according to name

Name  Age 0   tom   10 1  nick   15 2  juli   14

print(df.sort_values(by =’name’))

name age
1 nick 15
0 tom 10

How well did you know this?

Not at all

Perfectly

9 Sort by name and Age (descending)

Name Age
0 tom 10
1 tom 15
2 juli 14

print(df.sort_values([“Name”, “Age”],ascending=[True,False]))

How well did you know this?

Not at all

Perfectly

10 Subset column name as dataframe

Name Age
0 tom 10
1 ana 15
2 juli 14

df[[‘Name’]]

name
0 tom
1 nick

How well did you know this?

Not at all

Perfectly

11 Filter age equal or greater than 14

name age
0 tom 10
1 nick 15
2 ana 17

df[df[‘Age’]>=14]

name age
1 nick 15
2 ana 17

How well did you know this?

Not at all

Perfectly

12 Filter age equal or greater than 14 and name not tom

name  age 0   tom   10 1   nick   15 2   ana   17

df[(df[‘age’]>=14) & (df[‘name’]!=’tom’)]

name age
1 nick 15

How well did you know this?

Not at all

Perfectly

#13 Subsetting rows by categorical variables
#Get tom and ana with a conditional subsetting

name  age 0   tom   10 1   nick   15 2   ana   17

names=[‘tom’,’ana’]

condition = df[‘Name’].isin(names)

chosen =df[condition]
print(chosen)

name age
0 tom 10

How well did you know this?

Not at all

Perfectly

14 Add a column. Populate it doubling values of column age

name age double
0 tom 10 20
1 nick 15 30

df[‘double’] = df.age*2

How well did you know this?

Not at all

Perfectly

15 Add a column called age_2 and subset it using a conditional filter with a cutoff of 120

name age Age_2
1 nick 15 150

df[‘Age_2’] = df.age*10
old_filter = df[‘Age_2’]>120
df_old =df[old_filter]
print(df_old)

How well did you know this?

Not at all

Perfectly

16 What is summary statistics?

Information that gives a quick and simple description of the data. Can include mean, median, mode, minimum value, maximum value, range, standard deviation, etc

How well did you know this?

Not at all

Perfectly

17 Print the mean of column age

name  age 1  nick   15 2   ana   17

print(df.age.mean())

12.5

How well did you know this?

Not at all

Perfectly

18 Get maximum value and the minimum of column age

Name Age
0 peter 10
1 ana 15
2 tom 14

print(‘max:’, df.age.max())

print(‘min:’, df.age.min())

How well did you know this?

Not at all

Perfectly

19 What is IQR in Statistics?

IQR describes the middle 50% of values when ordered from lowest to highest. To find the interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.

How well did you know this?

Not at all

Perfectly

20 Create a custom a IQR function for a dataframe

def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

How well did you know this?

Not at all

Perfectly

21 Get the iqr of the column Double:

Output:
   A  B
0  0  0
1  1  2
2  2  4

IQR_B: 2.0

def iqr(column):
  return column.quantile(0.75) - column.quantile(0.25)

print(df)
print(‘’)
print(‘IQR_B:’, iqr(df.B))

How well did you know this?

Not at all

Perfectly

#22 Create a DataFrame using a loop and getting random integers (col A).
#B is double of A.
#C is cumulative sum of A
#D cumulative max of A (seed 123)

#Output:
A   B   C  D
0  2   4   2  2
1  2   4   4  2
2  6  12  10  6

import numpy as np
import pandas as pd

np.random.seed(123)
a = []
for i in range(3):
a.append(np.random.randint(0,10))

df=pd.DataFrame(a, columns =['A'])
df['B'] =df.A*2
df['C']=df.A.cumsum()
df['D']=df.A.cummax()
print(df)

How well did you know this?

Not at all

Perfectly

23 Drop duplicates

 name  age Uno    tom   10 Dos   nick   15 Tres  nick   15

Study These Flashcards

df.drop_duplicates(subset=’name’,inplace = True)
print(df)

 name  age Uno   tom   10 Dos  nick   15

24 Count different values in column name

 0   1 0   tom  10 1  nick  15 2  nick  15

nick 2
tom 1
Name: 0, dtype: int64

Study These Flashcards

df[0].value_counts()

#25 Count the percentage of frequencies in column Age in ascending order name age 0 tom 10 1 nick 15 2 nick 15 #Output: 10 0.3 15 0.7 Name: age, dtype: float64

round(df.age.value_counts(normalize=True, sort=False),1)

#26 Subset rows containing tom and sum their ages name age 0 tom 10 1 nick 15 2 nick 19

df[df['name']=='nick']['age'].sum()

#26 Subset rows containing tom and sum their ages name age 0 tom 10 1 nick 15 2 nick 19

df[df['name']=='nick']['age'].sum()

#27 Group by names and sum their money name money 0 tom 10 1 nick 15 2 nick 19 ``` #Output name nick 34 tom 10 Name: money, dtype: int64 ```

df.groupby(by='name')['money'].sum()

#28 Group by name and calculate mean and maximum value for age column name money 0 tom 10 1 nick 15 2 nick 19 #Output mean amax name nick 17 19 tom 10 10

df.groupby(by='name')['money'].agg([np.mean,np.max])

#29 Use pivot table to calculate the mean of ages for each name name age death 0 tom 18 90 1 nick 15 75 2 nick 23 115

df.pivot_table(index='Name', values='age',aggfunc='mean') age name nick 19

#30 Use pivot table to calculate the mean of ages and death, and get totals name age death 0 nick 15 75 1 nick 23 115

print(round(df.pivot_table(index='name', values=['age','death'], aggfunc='mean',margins =True),1)) ``` #Output: age death name nick 19.0 95.0 tom 18.0 90.0 All 18.7 93.3 ```

#31 Transform pivot table ‘xx’ into dataframe ``` age death name nick 19.0 95.0 tom 18.0 90.0 All 18.7 93.3 ```

xx.reset_index(inplace=True) #Output name age death 0 nick 19.0 95.0 1 tom 18.0 90.0 2 All 18.7 93.3

#32 Get tom with a boolean name money 0 tom 10 1 nick 15 2 nick 19 ``` #Output 0 True 1 False 2 False Name: name, dtype: bool ```

selection= ['tom'] | df.name.isin(selection)

#33 Get tom using a list. Subset data frame name money 0 tom 10 1 nick 15 2 nick 19 #Output Name Age Death 1 tom 12 108 2 tom 10 90

selection = ['tom'] | print(df[df.name.isin(selection)])

#34 Multi index (year,month)and subset with loc ``` month year sale 0 1 2012 55 1 4 2015 40 2 4 2016 84 3 10 2014 31 4 9 2013 45 ``` #Output sale year month 2012 1 55 2014 10 31

df.set_index(['year','month'],inplace=True) rows_to_keep = [(2012,1),(2014,10)] print(df.loc[rows_to_keep])

#35 Add multiindex and sort ``` month year sale 0 1 2012 55 1 4 2015 40 2 4 2016 84 3 10 2014 31 4 9 2013 45 ``` ``` #Output: sale year month 2016 4 84 2015 4 40 2014 10 31 2013 9 45 2012 1 55 ```

df.set_index(['year','month'],inplace=True) | print(df.sort_index(level=['year','month'], ascending=[False,True]))

#36 Set index (year), sort and slice with loc (years,2012 to 2014) ``` month year sale 0 1 2012 55 1 4 2015 40 2 4 2016 84 3 10 2014 31 4 9 2013 45 ``` #Output ``` month sale year 2012 1 55 2013 9 45 2014 10 31 ```

df.set_index('year',inplace=True) df.sort_index(inplace=True) print(df.loc[2012:2014])

#37 Create a column with the length of ‘name’ #Output Name len 0 peter 5 1 ford 4 2 santana 7

df['len']=df['Name'].str.len()

``` #39 Create a dataframe with shape 3,4 #with random number with mean 0 and variance 1 #time index from January 1 2013 to January 3 ``` np.random.seed(123) #Output 0 1 2013-01-01 2.1 -2.8 2013-01-02 -0.4 0.2 2013-01-03 0.6 -0.2

df = pd.DataFrame(np.random.randn(3, 2), index=pd.date_range('20130101', periods=3)) print(round(df,1))

#40 What are offset aliases?

A number of string aliases given to useful common time series frequencies. e.g ‘Y’==year

#41 Slice in both directions. 0 1 2013-01-01 2.1 -2.8 2013-01-02 -0.4 0.2 2013-01-03 0.6 -0.2 #Output 0 2013-01-01 0.550524 2013-01-02 -1.108726

print(df.iloc[:2,:1])

#42 Slice Time series starting with 2014 0 2013-12-31 -0.539127 2014-12-31 0.055680 2015-12-31 0.136186

print(df.loc['2014':]) #Output: 0 2014-12-31 0.055680 2015-12-31 0.136186

#43 Populate column year 0 2013-12-31 -0.539127 2014-12-31 0.055680 2015-12-31 0.136186 #Output 0 year 2013-12-31 -0.167130 2013 2014-12-31 0.025054 2014 2015-12-31 -0.885574 2015

df['year'] = df.index.year

#44 Reset index (date)and change its name for ‘date’ using a dictionary 0 2013-12-31 1.069279 2014-12-31 0.598574 2015-12-31 -0.247078

df=df.reset_index().rename(columns={'index': 'bar'}) #Output date 0 0 2013-12-31 0.285032 1 2014-12-31 2.681187 2 2015-12-31 -0.249268

#45 Delete column col 'len’ from df Name len 0 peter 5 1 ford 4 2 santana 7

del(df['len']) ``` #Output: Name 0 peter 1 ford 2 santana ```

#46 Transform dataframe into pivot table A B C 0 Fer Masters NaN 1 Ger Graduate 22.0 2 John Graduate NaN #Output: ``` B Graduate Masters A Fer NaN NaN Ger 22.0 NaN John NaN NaN ```

df.pivot('A', 'B', 'C') Or df.pivot(index ='A', columns ='B', values ='C')

#47 Subsetting pivot tables (pt) ``` B Graduate Masters A Boby 23.0 NaN John NaN 27.0 Mina 21.0 NaN ``` B Graduate Masters A Fer NaN NaN Ger 22.0 NaN

pt_sub = pt.loc['Fer':'Ger'] | print(pt_sub)

#48 Calculate the mean of the columns in this pivot table (pt) ``` B Graduate Masters A Boby 23.0 NaN John NaN 27.0 Mina 21.0 NaN ``` ``` #Output: B Graduate 22.0 Masters NaN dtype: float64 ```

pt.mean()

#49 Group by titles, calculate the age mean and visualize it with a bar plot (matplotlib) A B C 0 John Masters 27 1 Boby Graduate 23 2 Mina Graduate 21 #Output (plot in ipynb doc) B Graduate 20.5 Masters 15.0 Name: C, dtype: float64

import matplotlib.pyplot as plt gb = df.groupby('B')['C'].mean() gb.plot(kind ='bar') plt.show()

#50 Group by dates and visualize column 0 with a line chart dates 0 0 2013-12-31 0.537526 1 2014-12-31 -0.290812 2 2015-12-31 0.679620 #Output (plot in ipynb doc) ``` dates 2013-12-31 0.537526 2014-12-31 -0.290812 2015-12-31 0.679620 Name: 0, dtype: float64 ```

``` import matplotlib.pyplot as plt df_grouped=df.groupby(by='dates')[0].sum() df_grouped.plot(kind='line') print(df_grouped) plt.show() ```

#51 Scatter plot with 0 in the x axis and Scatter as title import pandas as pd df = pd.DataFrame(np.random.randn(3)) print(df)

import matplotlib.pyplot as plt df. reset_index().plot(kind='scatter', x='index', y=0) plt. show() #Output: (plot in ipynb doc)

#52 display two histograms showing Titles (C ).Transparency 0.5 and two bins df = pd.DataFrame({'Uni': ['Harvard', 'UOC', 'Stanford','Tsinghua'], 'Title': ['Master', 'MBI', 'Master','MBI'], 'Students': [15, 22, 19,13]})

df[df['Title']=='MBI']['Students'].hist(alpha=0.5,bins=2) df[df['Title']=='Master']['Students'].hist(alpha=0.5,bins=2) plt.legend(['MBI','Master']) plt.show()

#53 Replace number 21 in df with NaN values ``` Uni Title Students 0 Harvard Master 21 1 UOC MBI 21 2 Stanford Master 19 3 Tsinghua MBI 13 ```

df.replace({21:np.nan}) ``` #Output: Uni Title Students 0 Harvard Master NaN 1 UOC MBI NaN 2 Stanford Master 19.0 3 Tsinghua MBI 13.0 ```

#54 Check if there are NaN values for each column ``` Uni Title Students 0 Harvard Master NaN 1 UOC MBI NaN 2 Stanford Master 19.0 3 Tsinghua MBI 13.0 ```

df.isna().any() #Uni False Title False Students True dtype: bool

#55 Count Nan values in each column and in the total ``` Uni Title Students 0 Harvard Master NaN 1 UOC MBI NaN 2 Stanford Master 19.0 3 Tsinghua MBI 13.0 ```

nan_cols=df.isna().sum() nan_total=df.isna().sum().sum() print('each col:\n', nan_cols) print('total:',nan_total) ``` #Output: each col: Uni 0 Title 0 Students 2 dtype: int64 total: 2 ```

#56 Delete Nan values Uni Title Students 0 Harvard Master NaN 1 UOC MBI NaN 2 Stanford Master 19.0

df.dropna() #Output: Uni Title Students 2 Stanford Master 19.0

#57 Transform NaN values into 0s Uni Title Students 0 Harvard Master NaN 1 UOC MBI NaN 2 Stanford Master 19.0

df.fillna(0) #Output: Uni Title Students 0 Harvard Master 0.0 1 UOC MBI 0.0 2 Stanford Master 19.0

4 Data Manipulation with pandas Flashcards

ipynb https://colab.research.google.com/drive/1sl64J9JdozMl39t0PZ3KJrYUqoYXKDgV?usp=sharing (57 cards)