4 Data Manipulation with pandas Flashcards

ipynb https://colab.research.google.com/drive/1sl64J9JdozMl39t0PZ3KJrYUqoYXKDgV?usp=sharing

1
Q

1 Print the first 4 rows of the dataframe

import pandas as pd
data = [[‘tom’, 10], [‘nick’, 15],[‘john’, 29]]
df = pd.DataFrame(data, columns = [‘Name’, ‘Age’])

A

df.head(2)

Name  Age 0   tom   10 1  nick   15
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

2 What does .info() ?

A

Shows information on each of the columns, such as the data type and number of missing values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

3 Get the number of rows and columns of the df

      Number Letter
0       0      h
1       1      o
2       2      u
3       3      s
4       4      e
#Output:
(5, 2)
A

df.shape

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

4 Get the mean, count, quartiles and other statistics with one line of code

   Number  Double
0       0       0
1       1       2
2       2       4
3       3       6
4       4       8
A

df.describe()

#Output
Number  Double
count     5.0     5.0
mean      2.0     4.0
std       1.6     3.2
min       0.0     0.0
25%       1.0     2.0
50%       2.0     4.0
75%       3.0     6.0
max       4.0     8.0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

5 What does the attribute .values?

A

Gets a two-dimensional NumPy array of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

6 Get the columns names of df

name  age 0   tom   10 1   nick   15
A

df.columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

7 What does the attribute .index ?

A

An index for the rows: either row numbers or row names.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

8 Sort df according to name

Name  Age 0   tom   10 1  nick   15 2  juli   14
A

print(df.sort_values(by =’name’))

name age
1 nick 15
0 tom 10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

9 Sort by name and Age (descending)

Name Age
0 tom 10
1 tom 15
2 juli 14

A

print(df.sort_values([“Name”, “Age”],ascending=[True,False]))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

10 Subset column name as dataframe

Name Age
0 tom 10
1 ana 15
2 juli 14

A

df[[‘Name’]]

name
0 tom
1 nick

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

11 Filter age equal or greater than 14

name age
0 tom 10
1 nick 15
2 ana 17

A

df[df[‘Age’]>=14]

name age
1 nick 15
2 ana 17

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

12 Filter age equal or greater than 14 and name not tom

name  age 0   tom   10 1   nick   15 2   ana   17
A

df[(df[‘age’]>=14) & (df[‘name’]!=’tom’)]

name age
1 nick 15

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
#13 Subsetting rows by categorical variables
#Get tom and ana with a conditional subsetting
name  age 0   tom   10 1   nick   15 2   ana   17
A

names=[‘tom’,’ana’]

condition = df[‘Name’].isin(names)

chosen =df[condition]
print(chosen)

name age
0 tom 10

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

14 Add a column. Populate it doubling values of column age

name age double
0 tom 10 20
1 nick 15 30

A

df[‘double’] = df.age*2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

15 Add a column called age_2 and subset it using a conditional filter with a cutoff of 120

name age Age_2
1 nick 15 150

A

df[‘Age_2’] = df.age*10
old_filter = df[‘Age_2’]>120
df_old =df[old_filter]
print(df_old)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

16 What is summary statistics?

A

Information that gives a quick and simple description of the data. Can include mean, median, mode, minimum value, maximum value, range, standard deviation, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

17 Print the mean of column age

name  age 1  nick   15 2   ana   17
A

print(df.age.mean())

12.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

18 Get maximum value and the minimum of column age

Name Age
0 peter 10
1 ana 15
2 tom 14

A

print(‘max:’, df.age.max())

print(‘min:’, df.age.min())

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

19 What is IQR in Statistics?

A

IQR describes the middle 50% of values when ordered from lowest to highest. To find the interquartile range (IQR), ​first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

20 Create a custom a IQR function for a dataframe

A
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

21 Get the iqr of the column Double:

Output:
   A  B
0  0  0
1  1  2
2  2  4

IQR_B: 2.0

A
def iqr(column):
  return column.quantile(0.75) - column.quantile(0.25)

print(df)
print(‘’)
print(‘IQR_B:’, iqr(df.B))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q
#22 Create a DataFrame using a loop and getting random integers (col A).
#B is double of A.
#C is cumulative sum of A
#D cumulative max of A (seed 123)
#Output:
A   B   C  D
0  2   4   2  2
1  2   4   4  2
2  6  12  10  6
A

import numpy as np
import pandas as pd

np.random.seed(123)
a = []
for i in range(3):
a.append(np.random.randint(0,10))

df=pd.DataFrame(a, columns =['A'])
df['B'] =df.A*2
df['C']=df.A.cumsum()
df['D']=df.A.cummax()
print(df)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

23 Drop duplicates

 name  age Uno    tom   10 Dos   nick   15 Tres  nick   15
A

df.drop_duplicates(subset=’name’,inplace = True)
print(df)

 name  age Uno   tom   10 Dos  nick   15
24
Q

24 Count different values in column name

 0   1 0   tom  10 1  nick  15 2  nick  15

nick 2
tom 1
Name: 0, dtype: int64

A

df[0].value_counts()

25
Q

25 Count the percentage of frequencies in column Age in ascending order

name  age 0   tom   10 1  nick   15 2  nick   15

10 0.3
15 0.7
Name: age, dtype: float64

A

round(df.age.value_counts(normalize=True, sort=False),1)

26
Q

26 Subset rows containing tom and sum their ages

name  age 0   tom   10 1  nick   15 2  nick   19
A

df[df[‘name’]==’nick’][‘age’].sum()

27
Q

26 Subset rows containing tom and sum their ages

name  age 0   tom   10 1  nick   15 2  nick   19
A

df[df[‘name’]==’nick’][‘age’].sum()

28
Q

27 Group by names and sum their money

name  money 0   tom     10 1  nick     15 2  nick     19
#Output
name
nick    34
tom     10
Name: money, dtype: int64
A

df.groupby(by=’name’)[‘money’].sum()

29
Q

28 Group by name and calculate mean and maximum value for age column

name  money 0   tom     10 1  nick     15 2  nick     19
   mean  amax name             nick    17    19 tom     10    10
A

df.groupby(by=’name’)[‘money’].agg([np.mean,np.max])

30
Q

29 Use pivot table to calculate the mean of ages for each name

name age death
0 tom 18 90
1 nick 15 75
2 nick 23 115

A

df.pivot_table(index=’Name’, values=’age’,aggfunc=’mean’)

age
name
nick 19

31
Q

30 Use pivot table to calculate the mean of ages and death, and get totals

name age death
0 nick 15 75
1 nick 23 115

A

print(round(df.pivot_table(index=’name’, values=[‘age’,’death’],
aggfunc=’mean’,margins =True),1))

#Output:
     age  death
name             
nick  19.0   95.0
tom   18.0   90.0
All   18.7   93.3
32
Q

31 Transform pivot table ‘xx’ into dataframe

      age  death
name             
nick  19.0   95.0
tom   18.0   90.0
All   18.7   93.3
A

Output

xx.reset_index(inplace=True)

name age death
0 nick 19.0 95.0
1 tom 18.0 90.0
2 All 18.7 93.3

33
Q

32 Get tom with a boolean

name money
0 tom 10
1 nick 15
2 nick 19

#Output
0     True
1    False
2    False
Name: name, dtype: bool
A

selection= [‘tom’]

df.name.isin(selection)

34
Q

33 Get tom using a list. Subset data frame

name money
0 tom 10
1 nick 15
2 nick 19

Name Age Death
1 tom 12 108
2 tom 10 90

A

selection = [‘tom’]

print(df[df.name.isin(selection)])

35
Q

34 Multi index (year,month)and subset with loc

    month  year  sale
0      1  2012    55
1      4  2015    40
2      4  2016    84
3     10  2014    31
4      9  2013    45
      sale year month       2012 1        55 2014 10       31
A

df.set_index([‘year’,’month’],inplace=True)
rows_to_keep = [(2012,1),(2014,10)]
print(df.loc[rows_to_keep])

36
Q

35 Add multiindex and sort

   month  year  sale
0      1  2012    55
1      4  2015    40
2      4  2016    84
3     10  2014    31
4      9  2013    45
#Output:
          sale
year month      
2016 4        84
2015 4        40
2014 10       31
2013 9        45
2012 1        55
A

df.set_index([‘year’,’month’],inplace=True)

print(df.sort_index(level=[‘year’,’month’], ascending=[False,True]))

37
Q

36 Set index (year), sort and slice with loc (years,2012 to 2014)

month  year  sale
0      1  2012    55
1      4  2015    40
2      4  2016    84
3     10  2014    31
4      9  2013    45
       month  sale
year             
2012      1    55
2013      9    45
2014     10    31
A

df.set_index(‘year’,inplace=True)
df.sort_index(inplace=True)
print(df.loc[2012:2014])

38
Q

37 Create a column with the length of ‘name’

  Name  len 0    peter    5 1     ford    4 2  santana    7
A

df[‘len’]=df[‘Name’].str.len()

39
Q

Output

#39 Create a dataframe with shape 3,4
#with random number with mean 0 and variance 1
#time index from January 1 2013 to January 3

np.random.seed(123)

         0    1 2013-01-01  2.1 -2.8 2013-01-02 -0.4  0.2 2013-01-03  0.6 -0.2
A

df = pd.DataFrame(np.random.randn(3, 2),
index=pd.date_range(‘20130101’, periods=3))

print(round(df,1))

40
Q

40 What are offset aliases?

A

A number of string aliases given to useful common time series frequencies.

e.g ‘Y’==year

41
Q

41 Slice in both directions.

                  0    1 2013-01-01  2.1 -2.8 2013-01-02 -0.4  0.2 2013-01-03  0.6 -0.2
       0 2013-01-01  0.550524 2013-01-02 -1.108726
A

print(df.iloc[:2,:1])

42
Q

42 Slice Time series starting with 2014

                  0 2013-12-31 -0.539127 2014-12-31  0.055680 2015-12-31  0.136186
A

Output:

print(df.loc[‘2014’:])

            0 2014-12-31  0.055680 2015-12-31  0.136186
43
Q

43 Populate column year

        0 2013-12-31 -0.539127 2014-12-31  0.055680 2015-12-31  0.136186
            0  year 2013-12-31 -0.167130  2013 2014-12-31  0.025054  2014 2015-12-31 -0.885574  2015
A

df[‘year’] = df.index.year

44
Q

44 Reset index (date)and change its name for ‘date’ using a dictionary

   0 2013-12-31  1.069279 2014-12-31  0.598574 2015-12-31 -0.247078
A

Output

df=df.reset_index().rename(columns={‘index’: ‘bar’})

date 0
0 2013-12-31 0.285032
1 2014-12-31 2.681187
2 2015-12-31 -0.249268

45
Q

45 Delete column col ‘len’ from df

 Name  len 0    peter    5 1     ford    4 2  santana    7
A

del(df[‘len’])

#Output:
Name
0    peter
1     ford
2  santana
46
Q

46 Transform dataframe into pivot table

A B C
0 Fer Masters NaN
1 Ger Graduate 22.0
2 John Graduate NaN

B     Graduate  Masters
A                      
Fer        NaN      NaN
Ger       22.0      NaN
John       NaN      NaN
A

df.pivot(‘A’, ‘B’, ‘C’)

Or

df.pivot(index =’A’, columns =’B’, values =’C’)

47
Q

47 Subsetting pivot tables (pt)

B     Graduate  Masters
A                      
Boby      23.0      NaN
John       NaN     27.0
Mina      21.0      NaN

B Graduate Masters
A
Fer NaN NaN
Ger 22.0 NaN

A

pt_sub = pt.loc[‘Fer’:’Ger’]

print(pt_sub)

48
Q

48 Calculate the mean of the columns in this pivot table (pt)

B     Graduate  Masters
A                      
Boby      23.0      NaN
John       NaN     27.0
Mina      21.0      NaN
#Output:
B
Graduate    22.0
Masters      NaN
dtype: float64
A

pt.mean()

49
Q

49 Group by titles, calculate the age mean and visualize it with a bar plot (matplotlib)

  A         B   C 0  John   Masters  27 1  Boby  Graduate  23 2  Mina  Graduate  21

B
Graduate 20.5
Masters 15.0
Name: C, dtype: float64

A

import matplotlib.pyplot as plt
gb = df.groupby(‘B’)[‘C’].mean()
gb.plot(kind =’bar’)
plt.show()

50
Q

50 Group by dates and visualize column 0 with a line chart

dates 0
0 2013-12-31 0.537526
1 2014-12-31 -0.290812
2 2015-12-31 0.679620

dates
2013-12-31    0.537526
2014-12-31   -0.290812
2015-12-31    0.679620
Name: 0, dtype: float64
A
import matplotlib.pyplot as plt
df_grouped=df.groupby(by='dates')[0].sum()
df_grouped.plot(kind='line')
print(df_grouped)
plt.show()
51
Q

51 Scatter plot with 0 in the x axis and Scatter as title

import pandas as pd
df = pd.DataFrame(np.random.randn(3))
print(df)

A

Output: (plot in ipynb doc)

import matplotlib.pyplot as plt

df. reset_index().plot(kind=’scatter’, x=’index’, y=0)
plt. show()

52
Q

52 display two histograms showing Titles (C ).Transparency 0.5 and two bins

df = pd.DataFrame({‘Uni’: [‘Harvard’, ‘UOC’, ‘Stanford’,’Tsinghua’],
‘Title’: [‘Master’, ‘MBI’, ‘Master’,’MBI’],
‘Students’: [15, 22, 19,13]})

A

df[df[‘Title’]==’MBI’][‘Students’].hist(alpha=0.5,bins=2)
df[df[‘Title’]==’Master’][‘Students’].hist(alpha=0.5,bins=2)
plt.legend([‘MBI’,’Master’])
plt.show()

53
Q

53 Replace number 21 in df with NaN values

    Uni   Title  Students
0   Harvard  Master        21
1       UOC     MBI        21
2  Stanford  Master        19
3  Tsinghua     MBI        13
A

df.replace({21:np.nan})

#Output:
Uni   Title  Students
0   Harvard  Master       NaN
1       UOC     MBI       NaN
2  Stanford  Master      19.0
3  Tsinghua     MBI      13.0
54
Q

54 Check if there are NaN values for each column

   Uni   Title  Students
0   Harvard  Master       NaN
1       UOC     MBI       NaN
2  Stanford  Master      19.0
3  Tsinghua     MBI      13.0
A

Uni False

df.isna().any()

Title False
Students True
dtype: bool

55
Q

55 Count Nan values in each column and in the total

        Uni   Title  Students
0   Harvard  Master       NaN
1       UOC     MBI       NaN
2  Stanford  Master      19.0
3  Tsinghua     MBI      13.0
A

nan_cols=df.isna().sum()
nan_total=df.isna().sum().sum()
print(‘each col:\n’, nan_cols)
print(‘total:’,nan_total)

#Output:
each col:
 Uni         0
Title       0
Students    2
dtype: int64
total: 2
56
Q

56 Delete Nan values

     Uni   Title  Students 0   Harvard  Master       NaN 1       UOC     MBI       NaN 2  Stanford  Master      19.0
A

Output:

df.dropna()

  Uni   Title  Students 2  Stanford  Master      19.0
57
Q

57 Transform NaN values into 0s

   Uni   Title  Students 0   Harvard  Master       NaN 1       UOC     MBI       NaN 2  Stanford  Master      19.0
A

Output:

df.fillna(0)

Uni Title Students
0 Harvard Master 0.0
1 UOC MBI 0.0
2 Stanford Master 19.0