Pandas Flashcards

Question

get the dimensions of a dataframe

Answer 1

dataframe.shape

Answer 2

dataframe.info()

Answer 3

dataframe.sort_values("column1")

Answer 4

dataframe.sort_values("column1", ascending = False)

Answer 5

dataframe.sort_values(["column1", "column2"])

Answer 6

overFive = dataframe["column1"] > 5 dataframe[overFive] This can also be done on one line: dataframe[dataframe['column1'] > 5]

Answer 7

goodEars = foxes['Ear goodness'] > 5 redOnes = foxes['colour'] == 'standard red' foxes[goodEars & redOnes] Can also be done in one line with parentheses around each condition: foxes[(foxes['Ear goodness'] >5) & (foxes['colour'] == 'standard red')]

Answer 8

redOrBlack = foxes['colour'].isin(['standard red', 'black']) | foxes[redOrBlack]

Answer 9

meditation["length in minutes"].mean()

Answer 10

meditation["length in minutes"].std()

Answer 11

meditation["length in minutes"].max()

Answer 12

meditation["length in minutes"].min()

Answer 13

``` this is the aggregate method - two steps: step 1 - define a function to do what you want like this: def perc30(column): return column.quantile(0.3) ``` step 2 - get the column and use .agg() method to apply the function like this: meditation["length in minutes"].agg(perc30) you can also pass lists of columns and lists of functions for multiple summary statistics.

Answer 14

df['column'].cumsum()

Answer 15

df.drop_duplicates(subset = 'column')

Answer 16

df.drop_duplicates(subset = ['col1', 'col2'])

Answer 17

df['column1'].value_counts()

Answer 18

df['column1'].value_counts(sort=True)

Answer 19

df[column1].value_counts(normalize = True)

Answer 20

meditation. groupby('day of the week')['quality'].mean() note: you can also use the agg() method in place of the mean() method here to apply custom functions, and multiple functions, like min() max() sum() etc.

Answer 21

meditation.groupby(['day of the week', 'retreat'])['quality'].agg([min,max,np.mean])

Answer 22

meditation.pivot_table(values= "quality", index = "days of the week")

Answer 23

meditation.pivot_table(values= "quality", index = "days of the week", aggfunc = np.median)

Answer 24

meditation.pivot_table(values = 'quality, index = ['day of the week', 'retreat'], aggfunc = [np.mean, np.std], margins = True)

Answer 25

meditation.set_index("date")

Answer 26

meditation.reset_index()

Answer 27

meditation.set_index(["date", "on_retreat"])

Answer 28

meditation.sort_index(levels=["amount_of_LSD", "on_retreat"], ascending = [True, False])

Answer 29

Ollie.loc["18/03/94": "18/03/19"]

Answer 30

Ollie.loc[("03/94", "Tues"), ("03/19", "Weds")] | pass a list of tuples with the outer and then inner indexes

Answer 31

need to index using a column and also sort the index. eg: df.set_index("date of birth").sort_index()

Answer 32

df['1994': '2019'] | basically the point of this is that you only need to use partial strings when you're slicing using df indexes

Answer 33

df.iloc[:5, 2:9] | of course the 5th row is number 4 and the 9th column is number 8, and slicing is end-exclusive.

Answer 34

meditation.pivot_table('quality, index = 'day of week', columns = 'retreat')

Answer 35

meditation_pivot.mean() - gives means for retreat (each column) meditation_pivot.mean(axis = 'columns') - gives mean for each row

Answer 36

``` pivot = df.pivot_table(etc.) pivot_means = pivot.mean(axis = 'columns') ``` print(pivot_means[pivot_means == pivot_means.max()])

Answer 37

meditation['length'].hist()

Answer 38

meditation['length'].hist(bins = 15)

Answer 39

dayMeans = meditation.groupby('day of week')['quality'].mean() dayMeans.plot(kind = "bar", title = "whatever - a good title") plt.show()

Answer 40

meditation. plot(x = "date", y = "quality", kind = "line") | plt. show()

Answer 41

df.plot(x = 'col1', y = 'col2', kind = 'line', rot = 45)

Answer 42

df.plot(x = 'col1', y = 'col2', kind= 'scatter')

Answer 43

meditation[meditation['retreat'] == True]['quality'].hist()

Answer 44

df['col1'].hist() df['col2'.hist() plt.legend(['col1', 'col2']) plt.show()

Answer 45

df['col1'].hist(alpha = 0.7)

Answer 46

totalDayTime = meditation.groupby('day of the week')['length'].sum() totalDayTime.plot(kind = "bar")

Answer 47

df.is_na().any()

Answer 48

df.is_na().sum()

Answer 49

df.dropna()

Answer 50

df.fillna(0)

Answer 51

do this with a list of dictionaries e.g.: dataframe1 = [ {'row1': 'col1', 'col2'}, 'row2': 'col1', 'col2'} ] row_by_row_df = pd.DataFrame(dataframe1)

Answer 52

do this with a dictionary of lists e.g.: dataframe1 = { 'col1': ['row1', 'row2'], 'col2': ['row1', 'row2'] } col_by_col_df = pd.DataFrame(dataframe1)

Answer 53

dateframe = pd.read_csv("path.csv")

Answer 54

dataframe.to_csv("path.csv")

Answer 55

``` _ = plt.hist(meditation['quality']) _ = plt.xlabel('meditation Quality') _ = plt.ylabel('Frequency') ```

Answer 56

``` bin_edges = [10,20,30,40,50,60,70,80] _ = plt.hist(df['col1'], bins = bin_edges) ```

Answer 57

_ = plt.hist(df.'col1', bins = 13)

Answer 58

import seaborn as sns | sns.set()

Answer 59

number of bins equal to the sqrt of samples

Answer 60

_ = plt.swarmplot(x = 'day of week', y = 'quality', data = meditation) _ = plt.xlabel('Day of Week') _ = plt.ylabel('Meditation Quality') plt.show()

Answer 61

``` x = np.sort(df[data]) n = x.size y = np.arange(1, n+1)/n plt.plot(x,y, marker = '.', linestyle = None) plt.xlabel = 'xlabel' plt.ylabel = 'ylabel' ```

Answer 62

empirical cumulative data function

Answer 63

_ = plt.show() _ = plt.show() ...etc.

Answer 64

plt.legend((dataSet1, dataSet2, etc.), loc = 'upper right')

Answer 65

np.percentile(df['col1'], [25,50,75])

Answer 66

[create ecdf] ``` percentiles = [2.5, 25, 50, 75, 97.5] data_ptiles = np.percentiles(df['col1'], percentiles) ``` _ = plt.plot(data_ptiles, percentiles/100, marker = 'D', color = 'red', linestyle = 'none') #the 'D' above stands for diamond

Answer 67

_ = sns.boxplot(x = 'retreat', y = 'quality', data = meditation) ``` _ = plt.xlabel('on/off retreat') _ = plt.ylabel('meditation quality /5') ``` plt.show()

Answer 68

np.cov(df['col1'], df['col2'])

Answer 69

np.corrcoef(df['col1'], df['col2']) | [0,1] or [1,0]

Answer 70

np.random.random(size = 4)

Answer 71

np.empty(x)

Answer 72

np.random.binomial(10, 0.5, size = 100)

Answer 73

bins = np.arange(0, 11) - 0.5

Answer 74

np.random.poisson(8, size= 10000)

Answer 75

so you compute the mean and std of your data set: setMean = np.mean(df['col1']) setStd = np.std(df['col1']) then make a big set of perfectly normally distributed data with those numbers: theoreticalSamples = np.random.normal(setMean, setStd, size = 10000) x, y = ecdf(df['col1']) xtheor, ytheor = ecdf(theoreticalSamples) ``` _ = plt.plot(x,y, marker = '.', linestyle = None) _ = plt.plot(xtheor, ytheor, marker = '.', linestyle = None) ``` plt.show()

Pandas Flashcards

(99 cards)