Pandas Flashcards

1
Q

Create a Pandas dataframe with dictionaries

A

keys as column labels, values as lists of column data, then Df = pd.DataFrame(dict)

e.g.:
countries = {‘Name’: [‘UK’, ‘Germany’],
‘Capital’: [‘London’, ‘Berlin;]}

Countriesdf = pd.DataFrame(countries)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

set index labels for the rows of a pandas dataframe

A

pass a list of values to df.index, e.g.:

Countriesdf.index = [‘UK’, ‘GER’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

import a csv file as a dataframe

A

Countriesdf = pd.read_csv(‘file/location/countries.csv’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

make the first column of a dataframe the index of the csv (rather than a column in its own right)

A

Countriesdf = pd.read_csv(‘file/location/countries.csv’, index_col = 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Get a Pandas Series from a Dataframe

A

df[‘column name’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Get a single column of the Pandas Dataframe

A

df[[‘column name’]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Get multiple columns from a Pandas Dataframe

A

df[[‘column name1’, ‘column name2’]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Get the 2nd - 4th (inclusive) rows of a Pandas Dataframe

A

df[1:5]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

call a row of a Pandas Dataframe with loc, as a Pandas Series

A

df.loc[‘row index’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Call 2 rows of a Pandas Dataframe with loc, as a dataframe

A

df.loc[[‘row index’, ‘row index 2’]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

call the intersection of 2 rows of a pandas dataframe by name

A

df[[‘column name1’, ‘column name 2’], [‘row index 1’, ‘row index 2’]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

get a whole column of a dataframe by name

A

df[ : [‘column name1’, ‘column name 2’]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Let’s say you have numpy arrays of x = [1,2,3,4,5] and y =[5,4,3,2,1]. How you use operators with them to get an array of bools corresponding to x/y>2 or y-x = 2?

A

np.logical_or(x/y>2 , y-x ==2)

(this gives the array [False, True, False, False, True], just out of interest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

create a numpy array with a list of numbers 1,2,3,4,5

A

np.array([1,2,3,4,5])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Let’s say you have a dataframe with a list of days and the total minutes of meditation done on those days. What are the two steps to create a list of days on which more than one hour of meditation was done?

A
  1. select the meditation length column as a Pandas Series and do a comparison on that column:
    hourPlus = meditationDF[‘duration’] > 60
  2. apply the results of that comparison to the dataframe:
    meditationDF[hourPlus]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Let’s say you have a dataframe with a list of days and the total minutes of meditation done on those days. What are the two steps to create a list of days on which more than one hour of meditation was done BUT less than 2 hours was done?

A

bt1and2 = np.logical_and(meditationDF[‘duration’] > 60, meditationDF[‘duration’] < 120)

meditationDF[bt1and2]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

iterate over the rows of a dataframe and print out every row and label

A

for label, row in np.iterrows(dataframe):

print(label, ‘\n’, row)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

let’s say you’ve got a pandas dataframe with a bunch of columns, and one of them is ‘Day of week’. Iterate over the rows of the dataframe and print out only the value of that column for each row.

A

for label, row in np.iterrows(dataframe):

print(row[‘Day of week’])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

create a new column in a dataframe by applying a function to an existing column

A

dataframe[“column length”] = dataframe[‘existing column’].apply(len)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

create a new column in a dataframe by applying a method to an existing column

A

dataframe[“NEW COLUMN” = dataframe[‘existing column’].apply(str.upper)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

get some summary statistics for a dataframe

A

dataframe.describe()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

get all values a dataframe as a 2d numpy array

A

pd.DataFrame(df).to_numpy()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

get column names of a dataframe

A

dataframe.columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

get the row numbers or names of a dataframe

A

dataframe.index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
get the dimensions of a dataframe
dataframe.shape
26
find out if there are any missing entries in a dataframe
dataframe.info()
27
sort a dataframe by value to get the largest values of "column1" at the top.
dataframe.sort_values("column1")
28
sort a dataframe by values to get the smallest values of "column1" at the top
dataframe.sort_values("column1", ascending = False)
29
sort by one value(column1), then another (column2)
dataframe.sort_values(["column1", "column2"])
30
get rows from a dataframe with a column1 value of greater than 5
overFive = dataframe["column1"] > 5 dataframe[overFive] This can also be done on one line: dataframe[dataframe['column1'] > 5]
31
Let's say you've got a dataframe of fox species: 'foxes'. Get all the rows with an 'Ear goodness' of above 5 and with 'colour' of 'standard red'
goodEars = foxes['Ear goodness'] > 5 redOnes = foxes['colour'] == 'standard red' foxes[goodEars & redOnes] Can also be done in one line with parentheses around each condition: foxes[(foxes['Ear goodness'] >5) & (foxes['colour'] == 'standard red')]
32
from your dataframe of foxes, get all the fox species that are either of the 'colour': 'standard red', or 'black'
redOrBlack = foxes['colour'].isin(['standard red', 'black']) | foxes[redOrBlack]
33
Show the mean of a dataframe (meditation) column called "length in minutes"
meditation["length in minutes"].mean()
34
Show the standard deviation of a dataframe (meditation) column called "length in minutes"
meditation["length in minutes"].std()
35
Get the maximum value of a column called "length in minutes" from a dataframe called "meditation"
meditation["length in minutes"].max()
36
Get the minimum value of a column called "length in minutes" from a dataframe called "meditation"
meditation["length in minutes"].min()
37
create a function that gets the 30th percentile of a column called length in minutes in a dataset called meditation
``` this is the aggregate method - two steps: step 1 - define a function to do what you want like this: def perc30(column): return column.quantile(0.3) ``` step 2 - get the column and use .agg() method to apply the function like this: meditation["length in minutes"].agg(perc30) you can also pass lists of columns and lists of functions for multiple summary statistics.
38
get the cumulative sum of a dataframe column
df['column'].cumsum()
39
get rid of duplicates from a column in a dataframe
df.drop_duplicates(subset = 'column')
40
get rid of duplicates from a dataframe - identify duplicates as matching on data from both col1 and col2
df.drop_duplicates(subset = ['col1', 'col2'])
41
get the number of each value in a dataframe column
df['column1'].value_counts()
42
get the number of each value in a dataframe column, and sort them
df['column1'].value_counts(sort=True)
43
get the proportion of each value in a dataframe column (e.g. 0.25 if instances of a given value constitutes a quarter of the total number of values)
df[column1].value_counts(normalize = True)
44
Let's say you've got a dataframe called meditation - get the mean 'quality' score for each 'day of the week'.
meditation. groupby('day of the week')['quality'].mean() note: you can also use the agg() method in place of the mean() method here to apply custom functions, and multiple functions, like min() max() sum() etc.
45
Let's say you've got a dataframe called meditation - get the mean, minimum and maximum 'quality' scores grouped by each 'day of the week' and 'retreat' (yes/no)
meditation.groupby(['day of the week', 'retreat'])['quality'].agg([min,max,np.mean])
46
in a dateframe called meditation, use a pivot table to get the mean 'quality' for each value of the column 'days of the week'
meditation.pivot_table(values= "quality", index = "days of the week")
47
in a dateframe called meditation, use a pivot table to get the median 'quality' for each value of the column 'days of the week'
meditation.pivot_table(values= "quality", index = "days of the week", aggfunc = np.median)
48
with the dataframe meditation, create a pivot table to show the mean quality of meditation and standard deviation on each of the days of the week when it is and isn't a retreat, with margins giving averages for all days, and for on and off retreat
meditation.pivot_table(values = 'quality, index = ['day of the week', 'retreat'], aggfunc = [np.mean, np.std], margins = True)
49
Set the date column of your meditation dataframe as the index of the dataframe
meditation.set_index("date")
50
reset the index of a dataframe to standard numbers
meditation.reset_index()
51
create a hierarchical (or multi-level) index of your meditation dataframe using the columns "date" and "on_retreat"
meditation.set_index(["date", "on_retreat"])
52
let's says you've got a meditation dataframe hierarchically indexed by 1. on_retreat and 2. amount_of_LSD. Sort the dataframe first by LSD, in ascending order, and then on_retreat, in descending order.
meditation.sort_index(levels=["amount_of_LSD", "on_retreat"], ascending = [True, False])
53
slice a dataframe called Ollie that's indexed by date (format dd/mm/yy) from 18/03/94 to 18/03/2019 (inclusive)
Ollie.loc["18/03/94": "18/03/19"]
54
slice a dataframe called Ollie that's been hierarchically indexed by both date (mm/yy) and day of week(mon, tues, etc.) from the first tuesday march 1994 to the last wednesday march 2019
Ollie.loc[("03/94", "Tues"), ("03/19", "Weds")] | pass a list of tuples with the outer and then inner indexes
55
precondition to slicing a dataframe using indexes?
need to index using a column and also sort the index. eg: df.set_index("date of birth").sort_index()
56
slice a dataframe that's been indexed by date, formatted yyyy-mm-dd, including all the dates from 1994 to 2019.
df['1994': '2019'] | basically the point of this is that you only need to use partial strings when you're slicing using df indexes
57
get the 1st to 5th rows, and 3rd to 9th columns of a dataframe.
df.iloc[:5, 2:9] | of course the 5th row is number 4 and the 9th column is number 8, and slicing is end-exclusive.
58
create a pivot table with dataframe meditation that has day of week as rows and retreat as columns, displaying meditation 'quality'
meditation.pivot_table('quality, index = 'day of week', columns = 'retreat')
59
working on a pivot table (meditation_pivot) with retreat y/n as columns and day of week as rows, get the mean values for the columns and then get the mean values for rows.
meditation_pivot.mean() - gives means for retreat (each column) meditation_pivot.mean(axis = 'columns') - gives mean for each row
60
get the means for the rows of a pivot table and then filter it by the max mean value of the rows
``` pivot = df.pivot_table(etc.) pivot_means = pivot.mean(axis = 'columns') ``` print(pivot_means[pivot_means == pivot_means.max()])
61
create a histogram of meditation length
meditation['length'].hist()
62
create a histogram of meditation length with 15 bins
meditation['length'].hist(bins = 15)
63
group meditation mean quality data by day of week and create a bar plot to show it
dayMeans = meditation.groupby('day of week')['quality'].mean() dayMeans.plot(kind = "bar", title = "whatever - a good title") plt.show()
64
Plot a line graph of quality over dates from the meditation dataframe.
meditation. plot(x = "date", y = "quality", kind = "line") | plt. show()
65
you're creating a line plot from a dataframe - rotate the x labels by 45 degrees
df.plot(x = 'col1', y = 'col2', kind = 'line', rot = 45)
66
create an arbitrary scatterplot from a dataframe
df.plot(x = 'col1', y = 'col2', kind= 'scatter')
67
create a histogram of meditation quality just with the data from retreats
meditation[meditation['retreat'] == True]['quality'].hist()
68
make 2 arbitrary histograms and a legend
df['col1'].hist() df['col2'.hist() plt.legend(['col1', 'col2']) plt.show()
69
make a translucent histogram from a dataframe
df['col1'].hist(alpha = 0.7)
70
make a bar plot of the total amount of time spent meditating on each day of the week
totalDayTime = meditation.groupby('day of the week')['length'].sum() totalDayTime.plot(kind = "bar")
71
find whether any columns of a df have missing values
df.is_na().any()
72
get the sum of any missing values in the columns of a dataframe
df.is_na().sum()
73
delete any rows with missing data
df.dropna()
74
replace any missing values with 0
df.fillna(0)
75
create a dataframe row by row
do this with a list of dictionaries e.g.: dataframe1 = [ {'row1': 'col1', 'col2'}, 'row2': 'col1', 'col2'} ] row_by_row_df = pd.DataFrame(dataframe1)
76
create a dataframe column by column
do this with a dictionary of lists e.g.: dataframe1 = { 'col1': ['row1', 'row2'], 'col2': ['row1', 'row2'] } col_by_col_df = pd.DataFrame(dataframe1)
77
Open a csv file in Pandas
dateframe = pd.read_csv("path.csv")
78
save a csv file from pandas df
dataframe.to_csv("path.csv")
79
plot a simple histogram of meditation quality
``` _ = plt.hist(meditation['quality']) _ = plt.xlabel('meditation Quality') _ = plt.ylabel('Frequency') ```
80
set the bin edges for a histogram, and create a histogram using them
``` bin_edges = [10,20,30,40,50,60,70,80] _ = plt.hist(df['col1'], bins = bin_edges) ```
81
Create a histogram with 13 bins
_ = plt.hist(df.'col1', bins = 13)
82
set up seaborn
import seaborn as sns | sns.set()
83
what's the histogram bin number rule for data sets?
number of bins equal to the sqrt of samples
84
create a bee swarm plot of meditation quality on each day of the week from df 'meditation'
_ = plt.swarmplot(x = 'day of week', y = 'quality', data = meditation) _ = plt.xlabel('Day of Week') _ = plt.ylabel('Meditation Quality') plt.show()
85
create an ecdf with dataframe data
``` x = np.sort(df[data]) n = x.size y = np.arange(1, n+1)/n plt.plot(x,y, marker = '.', linestyle = None) plt.xlabel = 'xlabel' plt.ylabel = 'ylabel' ```
86
what the fuck does ecdf stand for??
empirical cumulative data function
87
create multiple plots on the same graph
_ = plt.show() _ = plt.show() ...etc.
88
create a legend for a graph showing multiple data sets
plt.legend((dataSet1, dataSet2, etc.), loc = 'upper right')
89
print the 25th, 50th, and 75th percentile for a dataframe column
np.percentile(df['col1'], [25,50,75])
90
plot some percentiles of a dataset on the ECDF
[create ecdf] ``` percentiles = [2.5, 25, 50, 75, 97.5] data_ptiles = np.percentiles(df['col1'], percentiles) ``` _ = plt.plot(data_ptiles, percentiles/100, marker = 'D', color = 'red', linestyle = 'none') #the 'D' above stands for diamond
91
create a box plot comparing meditation quality on retreat vs off retreat (you gotta use seaborn)
_ = sns.boxplot(x = 'retreat', y = 'quality', data = meditation) ``` _ = plt.xlabel('on/off retreat') _ = plt.ylabel('meditation quality /5') ``` plt.show()
92
get the covariance matrix for two variables, col1 & col2
np.cov(df['col1'], df['col2'])
93
get the correlation coefficient for two variables
np.corrcoef(df['col1'], df['col2']) | [0,1] or [1,0]
94
get 4 random numbers
np.random.random(size = 4)
95
create an empty array of x length
np.empty(x)
96
create a simulation of 100 trials of 10 coin flips
np.random.binomial(10, 0.5, size = 100)
97
compute the bin edges for a histogram such that the bars center over each integer from 0 to 10.
bins = np.arange(0, 11) - 0.5
98
create a poisson distribution for an event with a mean of 8 occurrences, over 10,000 trials
np.random.poisson(8, size= 10000)
99
Create a graph to demonstrate/check the normality of a data set with ecdf plots
so you compute the mean and std of your data set: setMean = np.mean(df['col1']) setStd = np.std(df['col1']) then make a big set of perfectly normally distributed data with those numbers: theoreticalSamples = np.random.normal(setMean, setStd, size = 10000) x, y = ecdf(df['col1']) xtheor, ytheor = ecdf(theoreticalSamples) ``` _ = plt.plot(x,y, marker = '.', linestyle = None) _ = plt.plot(xtheor, ytheor, marker = '.', linestyle = None) ``` plt.show()