4 Data Manipulation with pandas Flashcards

Question 1

Q

1 Print the first 4 rows of the dataframe

import pandas as pd
data = [[‘tom’, 10], [‘nick’, 15],[‘john’, 29]]
df = pd.DataFrame(data, columns = [‘Name’, ‘Age’])

Answer

A

df.head(2)

Name  Age 0   tom   10 1  nick   15

Question 2

Q

2 What does .info() ?

Answer

A

Shows information on each of the columns, such as the data type and number of missing values.

Question 3

Q

3 Get the number of rows and columns of the df

      Number Letter
0       0      h
1       1      o
2       2      u
3       3      s
4       4      e

#Output:
(5, 2)

Question 4

Q

4 Get the mean, count, quartiles and other statistics with one line of code

   Number  Double
0       0       0
1       1       2
2       2       4
3       3       6
4       4       8

Answer

A

df.describe()

#Output
Number  Double
count     5.0     5.0
mean      2.0     4.0
std       1.6     3.2
min       0.0     0.0
25%       1.0     2.0
50%       2.0     4.0
75%       3.0     6.0
max       4.0     8.0

Question 5

Q

5 What does the attribute .values?

Answer

A

Gets a two-dimensional NumPy array of values.

Question 6

Q

6 Get the columns names of df

name  age 0   tom   10 1   nick   15

Answer

A

df.columns

Question 7

Q

7 What does the attribute .index ?

Answer

A

An index for the rows: either row numbers or row names.

Question 8

Q

8 Sort df according to name

Name  Age 0   tom   10 1  nick   15 2  juli   14

Answer

A

print(df.sort_values(by =’name’))

name age
1 nick 15
0 tom 10

Question 9

Q

9 Sort by name and Age (descending)

Name Age
0 tom 10
1 tom 15
2 juli 14

Answer

A

print(df.sort_values([“Name”, “Age”],ascending=[True,False]))

Question 10

Q

10 Subset column name as dataframe

Name Age
0 tom 10
1 ana 15
2 juli 14

Answer

A

df[[‘Name’]]

name
0 tom
1 nick

Question 11

Q

11 Filter age equal or greater than 14

name age
0 tom 10
1 nick 15
2 ana 17

Answer

A

df[df[‘Age’]>=14]

name age
1 nick 15
2 ana 17

Question 12

Q

12 Filter age equal or greater than 14 and name not tom

name  age 0   tom   10 1   nick   15 2   ana   17

Answer

A

df[(df[‘age’]>=14) & (df[‘name’]!=’tom’)]

name age
1 nick 15

Question 13

Q

#13 Subsetting rows by categorical variables
#Get tom and ana with a conditional subsetting

name  age 0   tom   10 1   nick   15 2   ana   17

Answer

A

names=[‘tom’,’ana’]

condition = df[‘Name’].isin(names)

chosen =df[condition]
print(chosen)

name age
0 tom 10

Question 14

Q

14 Add a column. Populate it doubling values of column age

name age double
0 tom 10 20
1 nick 15 30

Answer

A

df[‘double’] = df.age*2

Question 15

Q

15 Add a column called age_2 and subset it using a conditional filter with a cutoff of 120

name age Age_2
1 nick 15 150

Answer

A

df[‘Age_2’] = df.age*10
old_filter = df[‘Age_2’]>120
df_old =df[old_filter]
print(df_old)

Question 16

Q

16 What is summary statistics?

Answer

A

Information that gives a quick and simple description of the data. Can include mean, median, mode, minimum value, maximum value, range, standard deviation, etc

Question 17

Q

17 Print the mean of column age

name  age 1  nick   15 2   ana   17

Answer

A

print(df.age.mean())

12.5

Question 18

Q

18 Get maximum value and the minimum of column age

Name Age
0 peter 10
1 ana 15
2 tom 14

Answer

A

print(‘max:’, df.age.max())

print(‘min:’, df.age.min())

Question 19

Q

19 What is IQR in Statistics?

Answer

A

IQR describes the middle 50% of values when ordered from lowest to highest. To find the interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.

Question 20

Q

20 Create a custom a IQR function for a dataframe

Answer

A

def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

Question 21

Q

21 Get the iqr of the column Double:

Output:
   A  B
0  0  0
1  1  2
2  2  4

IQR_B: 2.0

Answer

A

def iqr(column):
  return column.quantile(0.75) - column.quantile(0.25)

print(df)
print(‘’)
print(‘IQR_B:’, iqr(df.B))

Question 22

Q

#22 Create a DataFrame using a loop and getting random integers (col A).
#B is double of A.
#C is cumulative sum of A
#D cumulative max of A (seed 123)

#Output:
A   B   C  D
0  2   4   2  2
1  2   4   4  2
2  6  12  10  6

Answer

A

import numpy as np
import pandas as pd

np.random.seed(123)
a = []
for i in range(3):
a.append(np.random.randint(0,10))

df=pd.DataFrame(a, columns =['A'])
df['B'] =df.A*2
df['C']=df.A.cumsum()
df['D']=df.A.cummax()
print(df)

Question 23

Q

23 Drop duplicates

 name  age Uno    tom   10 Dos   nick   15 Tres  nick   15

Answer

A

df.drop_duplicates(subset=’name’,inplace = True)
print(df)

 name  age Uno   tom   10 Dos  nick   15

Question 24

Q

24 Count different values in column name

 0   1 0   tom  10 1  nick  15 2  nick  15

nick 2
tom 1
Name: 0, dtype: int64

Answer

A

df[0].value_counts()

Question 25

Q

25 Count the percentage of frequencies in column Age in ascending order

name  age 0   tom   10 1  nick   15 2  nick   15

10 0.3
15 0.7
Name: age, dtype: float64

Answer

A

round(df.age.value_counts(normalize=True, sort=False),1)

Question 26

Q

26 Subset rows containing tom and sum their ages

name  age 0   tom   10 1  nick   15 2  nick   19

Answer

A

df[df[‘name’]==’nick’][‘age’].sum()

Question 27

Q

26 Subset rows containing tom and sum their ages

name  age 0   tom   10 1  nick   15 2  nick   19

Answer

A

df[df[‘name’]==’nick’][‘age’].sum()

Question 28

Q

27 Group by names and sum their money

name  money 0   tom     10 1  nick     15 2  nick     19

#Output
name
nick    34
tom     10
Name: money, dtype: int64

Answer

A

df.groupby(by=’name’)[‘money’].sum()

Question 29

Q

28 Group by name and calculate mean and maximum value for age column

name  money 0   tom     10 1  nick     15 2  nick     19

   mean  amax name             nick    17    19 tom     10    10

Answer

A

df.groupby(by=’name’)[‘money’].agg([np.mean,np.max])

Question 30

Q

29 Use pivot table to calculate the mean of ages for each name

name age death
0 tom 18 90
1 nick 15 75
2 nick 23 115

Answer

A

df.pivot_table(index=’Name’, values=’age’,aggfunc=’mean’)

age
name
nick 19

Question 31

Q

30 Use pivot table to calculate the mean of ages and death, and get totals

name age death
0 nick 15 75
1 nick 23 115

Answer

A

print(round(df.pivot_table(index=’name’, values=[‘age’,’death’],
aggfunc=’mean’,margins =True),1))

#Output:
     age  death
name             
nick  19.0   95.0
tom   18.0   90.0
All   18.7   93.3

Question 32

Q

31 Transform pivot table ‘xx’ into dataframe

      age  death
name             
nick  19.0   95.0
tom   18.0   90.0
All   18.7   93.3

Answer

A

Output

xx.reset_index(inplace=True)

name age death
0 nick 19.0 95.0
1 tom 18.0 90.0
2 All 18.7 93.3

Question 33

Q

32 Get tom with a boolean

name money
0 tom 10
1 nick 15
2 nick 19

#Output
0     True
1    False
2    False
Name: name, dtype: bool

Answer

A

selection= [‘tom’]

df.name.isin(selection)

Question 34

Q

33 Get tom using a list. Subset data frame

name money
0 tom 10
1 nick 15
2 nick 19

Name Age Death
1 tom 12 108
2 tom 10 90

Answer

A

selection = [‘tom’]

print(df[df.name.isin(selection)])

Question 35

Q

34 Multi index (year,month)and subset with loc

    month  year  sale
0      1  2012    55
1      4  2015    40
2      4  2016    84
3     10  2014    31
4      9  2013    45

      sale year month       2012 1        55 2014 10       31

Answer

A

df.set_index([‘year’,’month’],inplace=True)
rows_to_keep = [(2012,1),(2014,10)]
print(df.loc[rows_to_keep])

Question 36

Q

35 Add multiindex and sort

   month  year  sale
0      1  2012    55
1      4  2015    40
2      4  2016    84
3     10  2014    31
4      9  2013    45

#Output:
          sale
year month      
2016 4        84
2015 4        40
2014 10       31
2013 9        45
2012 1        55

Answer

A

df.set_index([‘year’,’month’],inplace=True)

print(df.sort_index(level=[‘year’,’month’], ascending=[False,True]))

Question 37

Q

36 Set index (year), sort and slice with loc (years,2012 to 2014)

month  year  sale
0      1  2012    55
1      4  2015    40
2      4  2016    84
3     10  2014    31
4      9  2013    45

       month  sale
year             
2012      1    55
2013      9    45
2014     10    31

Answer

A

df.set_index(‘year’,inplace=True)
df.sort_index(inplace=True)
print(df.loc[2012:2014])

Question 38

Q

37 Create a column with the length of ‘name’

  Name  len 0    peter    5 1     ford    4 2  santana    7

Answer

A

df[‘len’]=df[‘Name’].str.len()

Question 39

Q

Output

#39 Create a dataframe with shape 3,4
#with random number with mean 0 and variance 1
#time index from January 1 2013 to January 3

np.random.seed(123)

         0    1 2013-01-01  2.1 -2.8 2013-01-02 -0.4  0.2 2013-01-03  0.6 -0.2

Answer

A

df = pd.DataFrame(np.random.randn(3, 2),
index=pd.date_range(‘20130101’, periods=3))

print(round(df,1))

Question 40

Q

40 What are offset aliases?

Answer

A

A number of string aliases given to useful common time series frequencies.

e.g ‘Y’==year

Question 41

Q

41 Slice in both directions.

                  0    1 2013-01-01  2.1 -2.8 2013-01-02 -0.4  0.2 2013-01-03  0.6 -0.2

       0 2013-01-01  0.550524 2013-01-02 -1.108726

Answer

A

print(df.iloc[:2,:1])

Question 42

Q

42 Slice Time series starting with 2014

                  0 2013-12-31 -0.539127 2014-12-31  0.055680 2015-12-31  0.136186

Answer

A

Output:

print(df.loc[‘2014’:])

            0 2014-12-31  0.055680 2015-12-31  0.136186

Question 43

Q

43 Populate column year

        0 2013-12-31 -0.539127 2014-12-31  0.055680 2015-12-31  0.136186

            0  year 2013-12-31 -0.167130  2013 2014-12-31  0.025054  2014 2015-12-31 -0.885574  2015

Answer

A

df[‘year’] = df.index.year

Question 44

Q

44 Reset index (date)and change its name for ‘date’ using a dictionary

   0 2013-12-31  1.069279 2014-12-31  0.598574 2015-12-31 -0.247078

Answer

A

Output

df=df.reset_index().rename(columns={‘index’: ‘bar’})

date 0
0 2013-12-31 0.285032
1 2014-12-31 2.681187
2 2015-12-31 -0.249268

Question 45

Q

45 Delete column col ‘len’ from df

 Name  len 0    peter    5 1     ford    4 2  santana    7

Answer

A

del(df[‘len’])

#Output:
Name
0    peter
1     ford
2  santana

Question 46

Q

46 Transform dataframe into pivot table

A B C
0 Fer Masters NaN
1 Ger Graduate 22.0
2 John Graduate NaN

B     Graduate  Masters
A                      
Fer        NaN      NaN
Ger       22.0      NaN
John       NaN      NaN

Answer

A

df.pivot(‘A’, ‘B’, ‘C’)

Or

df.pivot(index =’A’, columns =’B’, values =’C’)

Question 47

Q

47 Subsetting pivot tables (pt)

B     Graduate  Masters
A                      
Boby      23.0      NaN
John       NaN     27.0
Mina      21.0      NaN

B Graduate Masters
A
Fer NaN NaN
Ger 22.0 NaN

Answer

A

pt_sub = pt.loc[‘Fer’:’Ger’]

print(pt_sub)

Question 48

Q

48 Calculate the mean of the columns in this pivot table (pt)

B     Graduate  Masters
A                      
Boby      23.0      NaN
John       NaN     27.0
Mina      21.0      NaN

#Output:
B
Graduate    22.0
Masters      NaN
dtype: float64

Answer

A

pt.mean()

Question 49

Q

49 Group by titles, calculate the age mean and visualize it with a bar plot (matplotlib)

  A         B   C 0  John   Masters  27 1  Boby  Graduate  23 2  Mina  Graduate  21

B
Graduate 20.5
Masters 15.0
Name: C, dtype: float64

Answer

A

import matplotlib.pyplot as plt
gb = df.groupby(‘B’)[‘C’].mean()
gb.plot(kind =’bar’)
plt.show()

Question 50

Q

50 Group by dates and visualize column 0 with a line chart

dates 0
0 2013-12-31 0.537526
1 2014-12-31 -0.290812
2 2015-12-31 0.679620

dates
2013-12-31    0.537526
2014-12-31   -0.290812
2015-12-31    0.679620
Name: 0, dtype: float64

Answer

A

import matplotlib.pyplot as plt
df_grouped=df.groupby(by='dates')[0].sum()
df_grouped.plot(kind='line')
print(df_grouped)
plt.show()

Question 51

Q

51 Scatter plot with 0 in the x axis and Scatter as title

import pandas as pd
df = pd.DataFrame(np.random.randn(3))
print(df)

Answer

A

Output: (plot in ipynb doc)

import matplotlib.pyplot as plt

df. reset_index().plot(kind=’scatter’, x=’index’, y=0)
plt. show()

Question 52

Q

52 display two histograms showing Titles (C ).Transparency 0.5 and two bins

df = pd.DataFrame({‘Uni’: [‘Harvard’, ‘UOC’, ‘Stanford’,’Tsinghua’],
‘Title’: [‘Master’, ‘MBI’, ‘Master’,’MBI’],
‘Students’: [15, 22, 19,13]})

Answer

A

df[df[‘Title’]==’MBI’][‘Students’].hist(alpha=0.5,bins=2)
df[df[‘Title’]==’Master’][‘Students’].hist(alpha=0.5,bins=2)
plt.legend([‘MBI’,’Master’])
plt.show()

Question 53

Q

53 Replace number 21 in df with NaN values

    Uni   Title  Students
0   Harvard  Master        21
1       UOC     MBI        21
2  Stanford  Master        19
3  Tsinghua     MBI        13

Answer

A

df.replace({21:np.nan})

#Output:
Uni   Title  Students
0   Harvard  Master       NaN
1       UOC     MBI       NaN
2  Stanford  Master      19.0
3  Tsinghua     MBI      13.0

Question 54

Q

54 Check if there are NaN values for each column

   Uni   Title  Students
0   Harvard  Master       NaN
1       UOC     MBI       NaN
2  Stanford  Master      19.0
3  Tsinghua     MBI      13.0

Answer

A

Uni False

df.isna().any()

Title False
Students True
dtype: bool

Question 55

Q

55 Count Nan values in each column and in the total

        Uni   Title  Students
0   Harvard  Master       NaN
1       UOC     MBI       NaN
2  Stanford  Master      19.0
3  Tsinghua     MBI      13.0

Answer

A

nan_cols=df.isna().sum()
nan_total=df.isna().sum().sum()
print(‘each col:\n’, nan_cols)
print(‘total:’,nan_total)

#Output:
each col:
 Uni         0
Title       0
Students    2
dtype: int64
total: 2

Question 56

Q

56 Delete Nan values

     Uni   Title  Students 0   Harvard  Master       NaN 1       UOC     MBI       NaN 2  Stanford  Master      19.0

Answer

A

Output:

df.dropna()

  Uni   Title  Students 2  Stanford  Master      19.0

Question 57

Q

57 Transform NaN values into 0s

   Uni   Title  Students 0   Harvard  Master       NaN 1       UOC     MBI       NaN 2  Stanford  Master      19.0

Answer

A

Output:

df.fillna(0)

Uni Title Students
0 Harvard Master 0.0
1 UOC MBI 0.0
2 Stanford Master 19.0

Brainscape's Knowledge GenomeTM

4 Data Manipulation with pandas Flashcards

ipynb https://colab.research.google.com/drive/1sl64J9JdozMl39t0PZ3KJrYUqoYXKDgV?usp=sharing

Brainscape's Knowledge Genome^TM