Second Flashcards

1
Q

Build df from dictionary

A

pd.DataFrame()

dict = { “country” : [“Nepal”, “India”, “China”], “Capital” : [‘Kathmandu”, “Delhi”, “Bejing”]}

SA = pd.DataFrame(dict)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

CSV to DataFrame

A

df = pd.read_csv(‘path/to/dataframe.csv)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

CSV to DataFrame

1. Index the df

A

index_col = 0

df = pd.read_csv(‘path/to/dataframe.csv, index_col = 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Sort from highest to lowest

A

df.sort_values(‘col’, ascending = False)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Sort by multiple variables

A

df.sort_values([‘col1’, ‘col2’])

  1. Sort by the col1
  2. Then sort by col2

dogs.sort_values([‘weight_kg’, ‘height_cm’], ascending = [True, False])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Subset rows

A

df[col] > #

0 True
1 False
2 True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Subset rows and get data

A

df[df[‘col’] > #]

0 Bella Labrador Brown
4 Max Labrador Black

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Subset based on a string

A

df[df[‘col1’] == ‘string’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Subset based on date

A

df[df[‘col_date’] > “2015-01-01”]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Subset on multiple condiditons

A

is_x = df[‘col1’] == ‘x’
is_y = df[‘col2’] == ‘y’
df[is_x & is_y]

 col1   col2   number 0    x        y          56
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Subset using isin()

A

is_x_or_y = df[‘col1].isin([‘x’, ‘y’])
df[is_x_or_y]

is_black_or_brown = dogs[‘color’].isin([‘black’, ‘brown’])
df[is_black_or_brown]

  name    color       height   0     Bel       brown       88 4     Max      black        55
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Adding new column + mutating df

A

df[‘new_col’] = df[‘col’] / 100

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Individuals per 10K

A

df[‘per_10k’] = 10000 * df[‘number’] / df[‘total’]

homelessness[‘indv_per_10k’] = 10000 * homelessness[‘individuals’] / homelessness[‘state_pop’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Summary Stats

A

df[‘col’].mean()

df[‘col’].median()

df[‘col’].mode()

df[‘col’].min()

df[‘col’].max()

df[‘col’].var()

df[‘col’].std()

df[‘col’].sum()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Quantile

A

df[‘col’].quantile()

Where a sample is divided into equal-sized, adjacent subgroups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Aggregate summary stats from one column

A

df[‘col’].agg(‘function’)

df[‘ext price’].agg(‘sum’)

             ext price sum          2.1
17
Q

Aggregate multiple summary stats from multiple columns

A

df[[‘col1’, ‘col2’]].agg([‘func1’, ‘func2’])

df[[‘price’, ‘quantity’]].agg([‘sum’, ‘mean’])

           price       quantity sum          2.1            25 mean        1                12
18
Q

Cumulative Sum

A

df[‘col1’].cumsum()

19
Q

Drop duplicates

A

df.drop_duplicates(subset = ‘col’)

vet_visits.drop_duplicates(subset = ‘breed’)

20
Q

Drop duplicates multiple columns

A

df.drop_duplicates(subset = [‘col1’, ‘col2’])

vet_visits.drop_duplicates(subset = [‘name’, ‘breed’])

21
Q

Count number of occurrences

A

df[‘col’].value_counts()

unique_dogs[‘breed’].value_counts()

labrador 4
Chow Chow 3
Poodle 2

22
Q

Count number of occurrences & sort values

A

df[‘col’].value_counts(sort = True)

23
Q

Count proportions of Total

A

df[‘col’].value_counts(normalize = True)

Lab 0.5
Chow 0.25
Poodle 0.25

24
Q

Grouped Summary Statistics by attribute Pt2

A

df[df[‘col’] == ‘attribute’][‘measure col’].stat()

dogs[dogs[‘color’] == ‘black’][‘weight’].mean()

dogs[dogs[‘color’] == ‘black’][‘weight’].sum()
dogs[dogs[‘color’] == ‘brown’][‘weight’].sum()

26
24

25
Grouped Summary Statistics (groupby) Pt2
df. groupby('col')['measure col'].stat() dogs. groupby('color')['weight'].mean() color black 26 brown 24
26
Multiple stats groupby summary
df. groupby('col')['measure col'].agg([min, max, sum]) dogs. groupby('color')['weigth'].agg([min, max, sum]) min max sum color black 24 29 53 brown 24 24 48
27
Group by multiple variables
df. groupby(['col1', 'col2'])['measure col'].stat() dogs. groupby(['color', 'breed'])['weight'].mean() ``` color breed black chow 25 lab 29 poodle 24 Brown chow 24 lab 25 ```
28
Group by multiple cols and multiple measures
df.groupby(['col1', 'col2'])[['measure col1'], ['measure col2']].stat()
29
Pivot Table
df.pivot_table(values = 'measure col', index = 'group_col') By default it takes the mean value of each group dogs.pivot_table(values = 'weight', index = 'color') weight color black 26 brown 24
30
Pivot Table: Change the statistic
aggfunc = df.pivot_table(values = 'measure col', index = 'group_col', aggfunc = np.median)
31
Pivot Table: Multiple statistics
aggfunc = [] df.pivot_table(values = 'measure col', index = 'group_col', aggfunc = [np.median, np.mean])
32
Pivot Table: 2 Variables
columns = df. pivot_table(values = 'measure col', index = 'group_col', columns = 'group_col2') dogs. pivot_table(values = 'weight', index = 'color', columns = 'breed') breed chow lab poodle color black NA 29 24 brown 24 24 NA
33
Pivot table: Fill Missing Values
fill_value = df. pivot_table(values = 'measure col', index = 'group_col', fill_value = 0) dogs. pivot_table(values = 'weight', index = 'color', columns = 'breed', fill_value = 0) breed chow lab poodle color black 0 29 24 brown 24 24 0
34
Pivot Table: Sum All
df. pivot_table(values = 'measure col', index = 'group_col', fill_value = 0, margins = True) dogs. pivot_table(values = 'weight', index = 'color', columns = 'breed', fill_value = 0, margins = True) ``` breed chow lab poodle All color black 0 29 24 53 brown 24 24 0 48 All 24 53 24 ```