4 Data Manipulation with pandas Flashcards
ipynb https://colab.research.google.com/drive/1sl64J9JdozMl39t0PZ3KJrYUqoYXKDgV?usp=sharing
1 Print the first 4 rows of the dataframe
import pandas as pd
data = [[‘tom’, 10], [‘nick’, 15],[‘john’, 29]]
df = pd.DataFrame(data, columns = [‘Name’, ‘Age’])
df.head(2)
Name Age 0 tom 10 1 nick 15
2 What does .info() ?
Shows information on each of the columns, such as the data type and number of missing values.
3 Get the number of rows and columns of the df
Number Letter 0 0 h 1 1 o 2 2 u 3 3 s 4 4 e
#Output: (5, 2)
df.shape
4 Get the mean, count, quartiles and other statistics with one line of code
Number Double 0 0 0 1 1 2 2 2 4 3 3 6 4 4 8
df.describe()
#Output Number Double count 5.0 5.0 mean 2.0 4.0 std 1.6 3.2 min 0.0 0.0 25% 1.0 2.0 50% 2.0 4.0 75% 3.0 6.0 max 4.0 8.0
5 What does the attribute .values?
Gets a two-dimensional NumPy array of values.
6 Get the columns names of df
name age 0 tom 10 1 nick 15
df.columns
7 What does the attribute .index ?
An index for the rows: either row numbers or row names.
8 Sort df according to name
Name Age 0 tom 10 1 nick 15 2 juli 14
print(df.sort_values(by =’name’))
name age
1 nick 15
0 tom 10
9 Sort by name and Age (descending)
Name Age
0 tom 10
1 tom 15
2 juli 14
print(df.sort_values([“Name”, “Age”],ascending=[True,False]))
10 Subset column name as dataframe
Name Age
0 tom 10
1 ana 15
2 juli 14
df[[‘Name’]]
name
0 tom
1 nick
11 Filter age equal or greater than 14
name age
0 tom 10
1 nick 15
2 ana 17
df[df[‘Age’]>=14]
name age
1 nick 15
2 ana 17
12 Filter age equal or greater than 14 and name not tom
name age 0 tom 10 1 nick 15 2 ana 17
df[(df[‘age’]>=14) & (df[‘name’]!=’tom’)]
name age
1 nick 15
#13 Subsetting rows by categorical variables #Get tom and ana with a conditional subsetting
name age 0 tom 10 1 nick 15 2 ana 17
names=[‘tom’,’ana’]
condition = df[‘Name’].isin(names)
chosen =df[condition]
print(chosen)
name age
0 tom 10
14 Add a column. Populate it doubling values of column age
name age double
0 tom 10 20
1 nick 15 30
df[‘double’] = df.age*2
15 Add a column called age_2 and subset it using a conditional filter with a cutoff of 120
name age Age_2
1 nick 15 150
df[‘Age_2’] = df.age*10
old_filter = df[‘Age_2’]>120
df_old =df[old_filter]
print(df_old)
16 What is summary statistics?
Information that gives a quick and simple description of the data. Can include mean, median, mode, minimum value, maximum value, range, standard deviation, etc
17 Print the mean of column age
name age 1 nick 15 2 ana 17
print(df.age.mean())
12.5
18 Get maximum value and the minimum of column age
Name Age
0 peter 10
1 ana 15
2 tom 14
print(‘max:’, df.age.max())
print(‘min:’, df.age.min())
19 What is IQR in Statistics?
IQR describes the middle 50% of values when ordered from lowest to highest. To find the interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.
20 Create a custom a IQR function for a dataframe
def iqr(column): return column.quantile(0.75) - column.quantile(0.25)
21 Get the iqr of the column Double:
Output: A B 0 0 0 1 1 2 2 2 4
IQR_B: 2.0
def iqr(column): return column.quantile(0.75) - column.quantile(0.25)
print(df)
print(‘’)
print(‘IQR_B:’, iqr(df.B))
#22 Create a DataFrame using a loop and getting random integers (col A). #B is double of A. #C is cumulative sum of A #D cumulative max of A (seed 123)
#Output: A B C D 0 2 4 2 2 1 2 4 4 2 2 6 12 10 6
import numpy as np
import pandas as pd
np.random.seed(123)
a = []
for i in range(3):
a.append(np.random.randint(0,10))
df=pd.DataFrame(a, columns =['A']) df['B'] =df.A*2 df['C']=df.A.cumsum() df['D']=df.A.cummax() print(df)