Second Flashcards
Build df from dictionary
pd.DataFrame()
dict = { “country” : [“Nepal”, “India”, “China”], “Capital” : [‘Kathmandu”, “Delhi”, “Bejing”]}
SA = pd.DataFrame(dict)
CSV to DataFrame
df = pd.read_csv(‘path/to/dataframe.csv)
CSV to DataFrame
1. Index the df
index_col = 0
df = pd.read_csv(‘path/to/dataframe.csv, index_col = 0)
Sort from highest to lowest
df.sort_values(‘col’, ascending = False)
Sort by multiple variables
df.sort_values([‘col1’, ‘col2’])
- Sort by the col1
- Then sort by col2
dogs.sort_values([‘weight_kg’, ‘height_cm’], ascending = [True, False])
Subset rows
df[col] > #
0 True
1 False
2 True
Subset rows and get data
df[df[‘col’] > #]
0 Bella Labrador Brown
4 Max Labrador Black
Subset based on a string
df[df[‘col1’] == ‘string’]
Subset based on date
df[df[‘col_date’] > “2015-01-01”]
Subset on multiple condiditons
is_x = df[‘col1’] == ‘x’
is_y = df[‘col2’] == ‘y’
df[is_x & is_y]
col1 col2 number 0 x y 56
Subset using isin()
is_x_or_y = df[‘col1].isin([‘x’, ‘y’])
df[is_x_or_y]
is_black_or_brown = dogs[‘color’].isin([‘black’, ‘brown’])
df[is_black_or_brown]
name color height 0 Bel brown 88 4 Max black 55
Adding new column + mutating df
df[‘new_col’] = df[‘col’] / 100
Individuals per 10K
df[‘per_10k’] = 10000 * df[‘number’] / df[‘total’]
homelessness[‘indv_per_10k’] = 10000 * homelessness[‘individuals’] / homelessness[‘state_pop’]
Summary Stats
df[‘col’].mean()
df[‘col’].median()
df[‘col’].mode()
df[‘col’].min()
df[‘col’].max()
df[‘col’].var()
df[‘col’].std()
df[‘col’].sum()
Quantile
df[‘col’].quantile()
Where a sample is divided into equal-sized, adjacent subgroups