Pandas Flashcards
Create a Pandas dataframe with dictionaries
keys as column labels, values as lists of column data, then Df = pd.DataFrame(dict)
e.g.:
countries = {‘Name’: [‘UK’, ‘Germany’],
‘Capital’: [‘London’, ‘Berlin;]}
Countriesdf = pd.DataFrame(countries)
set index labels for the rows of a pandas dataframe
pass a list of values to df.index, e.g.:
Countriesdf.index = [‘UK’, ‘GER’]
import a csv file as a dataframe
Countriesdf = pd.read_csv(‘file/location/countries.csv’)
make the first column of a dataframe the index of the csv (rather than a column in its own right)
Countriesdf = pd.read_csv(‘file/location/countries.csv’, index_col = 0)
Get a Pandas Series from a Dataframe
df[‘column name’]
Get a single column of the Pandas Dataframe
df[[‘column name’]]
Get multiple columns from a Pandas Dataframe
df[[‘column name1’, ‘column name2’]]
Get the 2nd - 4th (inclusive) rows of a Pandas Dataframe
df[1:5]
call a row of a Pandas Dataframe with loc, as a Pandas Series
df.loc[‘row index’]
Call 2 rows of a Pandas Dataframe with loc, as a dataframe
df.loc[[‘row index’, ‘row index 2’]]
call the intersection of 2 rows of a pandas dataframe by name
df[[‘column name1’, ‘column name 2’], [‘row index 1’, ‘row index 2’]]
get a whole column of a dataframe by name
df[ : [‘column name1’, ‘column name 2’]]
Let’s say you have numpy arrays of x = [1,2,3,4,5] and y =[5,4,3,2,1]. How you use operators with them to get an array of bools corresponding to x/y>2 or y-x = 2?
np.logical_or(x/y>2 , y-x ==2)
(this gives the array [False, True, False, False, True], just out of interest.
create a numpy array with a list of numbers 1,2,3,4,5
np.array([1,2,3,4,5])
Let’s say you have a dataframe with a list of days and the total minutes of meditation done on those days. What are the two steps to create a list of days on which more than one hour of meditation was done?
- select the meditation length column as a Pandas Series and do a comparison on that column:
hourPlus = meditationDF[‘duration’] > 60 - apply the results of that comparison to the dataframe:
meditationDF[hourPlus]
Let’s say you have a dataframe with a list of days and the total minutes of meditation done on those days. What are the two steps to create a list of days on which more than one hour of meditation was done BUT less than 2 hours was done?
bt1and2 = np.logical_and(meditationDF[‘duration’] > 60, meditationDF[‘duration’] < 120)
meditationDF[bt1and2]
iterate over the rows of a dataframe and print out every row and label
for label, row in np.iterrows(dataframe):
print(label, ‘\n’, row)
let’s say you’ve got a pandas dataframe with a bunch of columns, and one of them is ‘Day of week’. Iterate over the rows of the dataframe and print out only the value of that column for each row.
for label, row in np.iterrows(dataframe):
print(row[‘Day of week’])
create a new column in a dataframe by applying a function to an existing column
dataframe[“column length”] = dataframe[‘existing column’].apply(len)
create a new column in a dataframe by applying a method to an existing column
dataframe[“NEW COLUMN” = dataframe[‘existing column’].apply(str.upper)
get some summary statistics for a dataframe
dataframe.describe()
get all values a dataframe as a 2d numpy array
pd.DataFrame(df).to_numpy()
get column names of a dataframe
dataframe.columns
get the row numbers or names of a dataframe
dataframe.index
get the dimensions of a dataframe
dataframe.shape
find out if there are any missing entries in a dataframe
dataframe.info()
sort a dataframe by value to get the largest values of “column1” at the top.
dataframe.sort_values(“column1”)
sort a dataframe by values to get the smallest values of “column1” at the top
dataframe.sort_values(“column1”, ascending = False)
sort by one value(column1), then another (column2)
dataframe.sort_values([“column1”, “column2”])
get rows from a dataframe with a column1 value of greater than 5
overFive = dataframe[“column1”] > 5
dataframe[overFive]
This can also be done on one line:
dataframe[dataframe[‘column1’] > 5]
Let’s say you’ve got a dataframe of fox species: ‘foxes’. Get all the rows with an ‘Ear goodness’ of above 5 and with ‘colour’ of ‘standard red’
goodEars = foxes[‘Ear goodness’] > 5
redOnes = foxes[‘colour’] == ‘standard red’
foxes[goodEars & redOnes]
Can also be done in one line with parentheses around each condition:
foxes[(foxes[‘Ear goodness’] >5) & (foxes[‘colour’] == ‘standard red’)]
from your dataframe of foxes, get all the fox species that are either of the ‘colour’: ‘standard red’, or ‘black’
redOrBlack = foxes[‘colour’].isin([‘standard red’, ‘black’])
foxes[redOrBlack]
Show the mean of a dataframe (meditation) column called “length in minutes”
meditation[“length in minutes”].mean()
Show the standard deviation of a dataframe (meditation) column called “length in minutes”
meditation[“length in minutes”].std()
Get the maximum value of a column called “length in minutes” from a dataframe called “meditation”
meditation[“length in minutes”].max()
Get the minimum value of a column called “length in minutes” from a dataframe called “meditation”
meditation[“length in minutes”].min()
create a function that gets the 30th percentile of a column called length in minutes in a dataset called meditation
this is the aggregate method - two steps: step 1 - define a function to do what you want like this: def perc30(column): return column.quantile(0.3)
step 2 -
get the column and use .agg() method to apply the function like this:
meditation[“length in minutes”].agg(perc30)
you can also pass lists of columns and lists of functions for multiple summary statistics.
get the cumulative sum of a dataframe column
df[‘column’].cumsum()
get rid of duplicates from a column in a dataframe
df.drop_duplicates(subset = ‘column’)