Pandas Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Create a Pandas dataframe with dictionaries

A

keys as column labels, values as lists of column data, then Df = pd.DataFrame(dict)

e.g.:
countries = {‘Name’: [‘UK’, ‘Germany’],
‘Capital’: [‘London’, ‘Berlin;]}

Countriesdf = pd.DataFrame(countries)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

set index labels for the rows of a pandas dataframe

A

pass a list of values to df.index, e.g.:

Countriesdf.index = [‘UK’, ‘GER’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

import a csv file as a dataframe

A

Countriesdf = pd.read_csv(‘file/location/countries.csv’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

make the first column of a dataframe the index of the csv (rather than a column in its own right)

A

Countriesdf = pd.read_csv(‘file/location/countries.csv’, index_col = 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Get a Pandas Series from a Dataframe

A

df[‘column name’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Get a single column of the Pandas Dataframe

A

df[[‘column name’]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Get multiple columns from a Pandas Dataframe

A

df[[‘column name1’, ‘column name2’]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Get the 2nd - 4th (inclusive) rows of a Pandas Dataframe

A

df[1:5]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

call a row of a Pandas Dataframe with loc, as a Pandas Series

A

df.loc[‘row index’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Call 2 rows of a Pandas Dataframe with loc, as a dataframe

A

df.loc[[‘row index’, ‘row index 2’]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

call the intersection of 2 rows of a pandas dataframe by name

A

df[[‘column name1’, ‘column name 2’], [‘row index 1’, ‘row index 2’]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

get a whole column of a dataframe by name

A

df[ : [‘column name1’, ‘column name 2’]]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Let’s say you have numpy arrays of x = [1,2,3,4,5] and y =[5,4,3,2,1]. How you use operators with them to get an array of bools corresponding to x/y>2 or y-x = 2?

A

np.logical_or(x/y>2 , y-x ==2)

(this gives the array [False, True, False, False, True], just out of interest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

create a numpy array with a list of numbers 1,2,3,4,5

A

np.array([1,2,3,4,5])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Let’s say you have a dataframe with a list of days and the total minutes of meditation done on those days. What are the two steps to create a list of days on which more than one hour of meditation was done?

A
  1. select the meditation length column as a Pandas Series and do a comparison on that column:
    hourPlus = meditationDF[‘duration’] > 60
  2. apply the results of that comparison to the dataframe:
    meditationDF[hourPlus]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Let’s say you have a dataframe with a list of days and the total minutes of meditation done on those days. What are the two steps to create a list of days on which more than one hour of meditation was done BUT less than 2 hours was done?

A

bt1and2 = np.logical_and(meditationDF[‘duration’] > 60, meditationDF[‘duration’] < 120)

meditationDF[bt1and2]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

iterate over the rows of a dataframe and print out every row and label

A

for label, row in np.iterrows(dataframe):

print(label, ‘\n’, row)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

let’s say you’ve got a pandas dataframe with a bunch of columns, and one of them is ‘Day of week’. Iterate over the rows of the dataframe and print out only the value of that column for each row.

A

for label, row in np.iterrows(dataframe):

print(row[‘Day of week’])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

create a new column in a dataframe by applying a function to an existing column

A

dataframe[“column length”] = dataframe[‘existing column’].apply(len)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

create a new column in a dataframe by applying a method to an existing column

A

dataframe[“NEW COLUMN” = dataframe[‘existing column’].apply(str.upper)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

get some summary statistics for a dataframe

A

dataframe.describe()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

get all values a dataframe as a 2d numpy array

A

pd.DataFrame(df).to_numpy()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

get column names of a dataframe

A

dataframe.columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

get the row numbers or names of a dataframe

A

dataframe.index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

get the dimensions of a dataframe

A

dataframe.shape

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

find out if there are any missing entries in a dataframe

A

dataframe.info()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

sort a dataframe by value to get the largest values of “column1” at the top.

A

dataframe.sort_values(“column1”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

sort a dataframe by values to get the smallest values of “column1” at the top

A

dataframe.sort_values(“column1”, ascending = False)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

sort by one value(column1), then another (column2)

A

dataframe.sort_values([“column1”, “column2”])

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

get rows from a dataframe with a column1 value of greater than 5

A

overFive = dataframe[“column1”] > 5
dataframe[overFive]

This can also be done on one line:
dataframe[dataframe[‘column1’] > 5]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Let’s say you’ve got a dataframe of fox species: ‘foxes’. Get all the rows with an ‘Ear goodness’ of above 5 and with ‘colour’ of ‘standard red’

A

goodEars = foxes[‘Ear goodness’] > 5
redOnes = foxes[‘colour’] == ‘standard red’
foxes[goodEars & redOnes]

Can also be done in one line with parentheses around each condition:

foxes[(foxes[‘Ear goodness’] >5) & (foxes[‘colour’] == ‘standard red’)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

from your dataframe of foxes, get all the fox species that are either of the ‘colour’: ‘standard red’, or ‘black’

A

redOrBlack = foxes[‘colour’].isin([‘standard red’, ‘black’])

foxes[redOrBlack]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Show the mean of a dataframe (meditation) column called “length in minutes”

A

meditation[“length in minutes”].mean()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Show the standard deviation of a dataframe (meditation) column called “length in minutes”

A

meditation[“length in minutes”].std()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Get the maximum value of a column called “length in minutes” from a dataframe called “meditation”

A

meditation[“length in minutes”].max()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Get the minimum value of a column called “length in minutes” from a dataframe called “meditation”

A

meditation[“length in minutes”].min()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

create a function that gets the 30th percentile of a column called length in minutes in a dataset called meditation

A
this is the aggregate method - two steps:
step 1 - define a function to do what you want like this:
def perc30(column):
    return column.quantile(0.3)

step 2 -
get the column and use .agg() method to apply the function like this:
meditation[“length in minutes”].agg(perc30)

you can also pass lists of columns and lists of functions for multiple summary statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

get the cumulative sum of a dataframe column

A

df[‘column’].cumsum()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

get rid of duplicates from a column in a dataframe

A

df.drop_duplicates(subset = ‘column’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

get rid of duplicates from a dataframe - identify duplicates as matching on data from both col1 and col2

A

df.drop_duplicates(subset = [‘col1’, ‘col2’])

41
Q

get the number of each value in a dataframe column

A

df[‘column1’].value_counts()

42
Q

get the number of each value in a dataframe column, and sort them

A

df[‘column1’].value_counts(sort=True)

43
Q

get the proportion of each value in a dataframe column (e.g. 0.25 if instances of a given value constitutes a quarter of the total number of values)

A

df[column1].value_counts(normalize = True)

44
Q

Let’s say you’ve got a dataframe called meditation - get the mean ‘quality’ score for each ‘day of the week’.

A

meditation. groupby(‘day of the week’)[‘quality’].mean()
note: you can also use the agg() method in place of the mean() method here to apply custom functions, and multiple functions, like min() max() sum() etc.

45
Q

Let’s say you’ve got a dataframe called meditation - get the mean, minimum and maximum ‘quality’ scores grouped by each ‘day of the week’ and ‘retreat’ (yes/no)

A

meditation.groupby([‘day of the week’, ‘retreat’])[‘quality’].agg([min,max,np.mean])

46
Q

in a dateframe called meditation, use a pivot table to get the mean ‘quality’ for each value of the column ‘days of the week’

A

meditation.pivot_table(values= “quality”, index = “days of the week”)

47
Q

in a dateframe called meditation, use a pivot table to get the median ‘quality’ for each value of the column ‘days of the week’

A

meditation.pivot_table(values= “quality”, index = “days of the week”, aggfunc = np.median)

48
Q

with the dataframe meditation, create a pivot table to show the mean quality of meditation and standard deviation on each of the days of the week when it is and isn’t a retreat, with margins giving averages for all days, and for on and off retreat

A

meditation.pivot_table(values = ‘quality, index = [‘day of the week’, ‘retreat’], aggfunc = [np.mean, np.std], margins = True)

49
Q

Set the date column of your meditation dataframe as the index of the dataframe

A

meditation.set_index(“date”)

50
Q

reset the index of a dataframe to standard numbers

A

meditation.reset_index()

51
Q

create a hierarchical (or multi-level) index of your meditation dataframe using the columns “date” and “on_retreat”

A

meditation.set_index([“date”, “on_retreat”])

52
Q

let’s says you’ve got a meditation dataframe hierarchically indexed by 1. on_retreat and 2. amount_of_LSD. Sort the dataframe first by LSD, in ascending order, and then on_retreat, in descending order.

A

meditation.sort_index(levels=[“amount_of_LSD”, “on_retreat”], ascending = [True, False])

53
Q

slice a dataframe called Ollie that’s indexed by date (format dd/mm/yy) from 18/03/94 to 18/03/2019 (inclusive)

A

Ollie.loc[“18/03/94”: “18/03/19”]

54
Q

slice a dataframe called Ollie that’s been hierarchically indexed by both date (mm/yy) and day of week(mon, tues, etc.) from the first tuesday march 1994 to the last wednesday march 2019

A

Ollie.loc[(“03/94”, “Tues”), (“03/19”, “Weds”)]

pass a list of tuples with the outer and then inner indexes

55
Q

precondition to slicing a dataframe using indexes?

A

need to index using a column and also sort the index. eg:

df.set_index(“date of birth”).sort_index()

56
Q

slice a dataframe that’s been indexed by date, formatted yyyy-mm-dd, including all the dates from 1994 to 2019.

A

df[‘1994’: ‘2019’]

basically the point of this is that you only need to use partial strings when you’re slicing using df indexes

57
Q

get the 1st to 5th rows, and 3rd to 9th columns of a dataframe.

A

df.iloc[:5, 2:9]

of course the 5th row is number 4 and the 9th column is number 8, and slicing is end-exclusive.

58
Q

create a pivot table with dataframe meditation that has day of week as rows and retreat as columns, displaying meditation ‘quality’

A

meditation.pivot_table(‘quality, index = ‘day of week’, columns = ‘retreat’)

59
Q

working on a pivot table (meditation_pivot) with retreat y/n as columns and day of week as rows, get the mean values for the columns and then get the mean values for rows.

A

meditation_pivot.mean() - gives means for retreat (each column)

meditation_pivot.mean(axis = ‘columns’) - gives mean for each row

60
Q

get the means for the rows of a pivot table and then filter it by the max mean value of the rows

A
pivot = df.pivot_table(etc.)
pivot_means = pivot.mean(axis = 'columns')

print(pivot_means[pivot_means == pivot_means.max()])

61
Q

create a histogram of meditation length

A

meditation[‘length’].hist()

62
Q

create a histogram of meditation length with 15 bins

A

meditation[‘length’].hist(bins = 15)

63
Q

group meditation mean quality data by day of week and create a bar plot to show it

A

dayMeans = meditation.groupby(‘day of week’)[‘quality’].mean()

dayMeans.plot(kind = “bar”, title = “whatever - a good title”)

plt.show()

64
Q

Plot a line graph of quality over dates from the meditation dataframe.

A

meditation. plot(x = “date”, y = “quality”, kind = “line”)

plt. show()

65
Q

you’re creating a line plot from a dataframe - rotate the x labels by 45 degrees

A

df.plot(x = ‘col1’, y = ‘col2’, kind = ‘line’, rot = 45)

66
Q

create an arbitrary scatterplot from a dataframe

A

df.plot(x = ‘col1’, y = ‘col2’, kind= ‘scatter’)

67
Q

create a histogram of meditation quality just with the data from retreats

A

meditation[meditation[‘retreat’] == True][‘quality’].hist()

68
Q

make 2 arbitrary histograms and a legend

A

df[‘col1’].hist()
df[‘col2’.hist()
plt.legend([‘col1’, ‘col2’])
plt.show()

69
Q

make a translucent histogram from a dataframe

A

df[‘col1’].hist(alpha = 0.7)

70
Q

make a bar plot of the total amount of time spent meditating on each day of the week

A

totalDayTime = meditation.groupby(‘day of the week’)[‘length’].sum()

totalDayTime.plot(kind = “bar”)

71
Q

find whether any columns of a df have missing values

A

df.is_na().any()

72
Q

get the sum of any missing values in the columns of a dataframe

A

df.is_na().sum()

73
Q

delete any rows with missing data

A

df.dropna()

74
Q

replace any missing values with 0

A

df.fillna(0)

75
Q

create a dataframe row by row

A

do this with a list of dictionaries
e.g.:
dataframe1 = [
{‘row1’: ‘col1’, ‘col2’},
‘row2’: ‘col1’, ‘col2’}
]

row_by_row_df = pd.DataFrame(dataframe1)

76
Q

create a dataframe column by column

A

do this with a dictionary of lists
e.g.:
dataframe1 = {
‘col1’: [‘row1’, ‘row2’],
‘col2’: [‘row1’, ‘row2’]
}

col_by_col_df = pd.DataFrame(dataframe1)

77
Q

Open a csv file in Pandas

A

dateframe = pd.read_csv(“path.csv”)

78
Q

save a csv file from pandas df

A

dataframe.to_csv(“path.csv”)

79
Q

plot a simple histogram of meditation quality

A
_ = plt.hist(meditation['quality'])
_ = plt.xlabel('meditation Quality')
_ = plt.ylabel('Frequency')
80
Q

set the bin edges for a histogram, and create a histogram using them

A
bin_edges = [10,20,30,40,50,60,70,80]
_ = plt.hist(df['col1'], bins = bin_edges)
81
Q

Create a histogram with 13 bins

A

_ = plt.hist(df.’col1’, bins = 13)

82
Q

set up seaborn

A

import seaborn as sns

sns.set()

83
Q

what’s the histogram bin number rule for data sets?

A

number of bins equal to the sqrt of samples

84
Q

create a bee swarm plot of meditation quality on each day of the week from df ‘meditation’

A

_ = plt.swarmplot(x = ‘day of week’, y = ‘quality’, data = meditation)
_ = plt.xlabel(‘Day of Week’)
_ = plt.ylabel(‘Meditation Quality’)
plt.show()

85
Q

create an ecdf with dataframe data

A
x = np.sort(df[data])
n = x.size
y = np.arange(1, n+1)/n
plt.plot(x,y, marker = '.', linestyle = None)
plt.xlabel = 'xlabel'
plt.ylabel = 'ylabel'
86
Q

what the fuck does ecdf stand for??

A

empirical cumulative data function

87
Q

create multiple plots on the same graph

A

_ = plt.show()
_ = plt.show()
…etc.

88
Q

create a legend for a graph showing multiple data sets

A

plt.legend((dataSet1, dataSet2, etc.), loc = ‘upper right’)

89
Q

print the 25th, 50th, and 75th percentile for a dataframe column

A

np.percentile(df[‘col1’], [25,50,75])

90
Q

plot some percentiles of a dataset on the ECDF

A

the ‘D’ above stands for diamond

[create ecdf]

percentiles = [2.5, 25, 50, 75, 97.5]
data_ptiles = np.percentiles(df['col1'], percentiles)

_ = plt.plot(data_ptiles, percentiles/100, marker = ‘D’, color = ‘red’, linestyle = ‘none’)

91
Q

create a box plot comparing meditation quality on retreat vs off retreat (you gotta use seaborn)

A

_ = sns.boxplot(x = ‘retreat’, y = ‘quality’, data = meditation)

_ = plt.xlabel('on/off retreat')
_ = plt.ylabel('meditation quality /5')

plt.show()

92
Q

get the covariance matrix for two variables, col1 & col2

A

np.cov(df[‘col1’], df[‘col2’])

93
Q

get the correlation coefficient for two variables

A

np.corrcoef(df[‘col1’], df[‘col2’])

[0,1] or [1,0]

94
Q

get 4 random numbers

A

np.random.random(size = 4)

95
Q

create an empty array of x length

A

np.empty(x)

96
Q

create a simulation of 100 trials of 10 coin flips

A

np.random.binomial(10, 0.5, size = 100)

97
Q

compute the bin edges for a histogram such that the bars center over each integer from 0 to 10.

A

bins = np.arange(0, 11) - 0.5

98
Q

create a poisson distribution for an event with a mean of 8 occurrences, over 10,000 trials

A

np.random.poisson(8, size= 10000)

99
Q

Create a graph to demonstrate/check the normality of a data set with ecdf plots

A

so you compute the mean and std of your data set:
setMean = np.mean(df[‘col1’])
setStd = np.std(df[‘col1’])

then make a big set of perfectly normally distributed data with those numbers:

theoreticalSamples = np.random.normal(setMean, setStd, size = 10000)

x, y = ecdf(df[‘col1’])
xtheor, ytheor = ecdf(theoreticalSamples)

_ = plt.plot(x,y, marker = '.', linestyle = None)
_ = plt.plot(xtheor, ytheor, marker = '.', linestyle = None)

plt.show()