Pandas Flashcards
DataFrame:
Create a dataframe from scratch
First create lists age = [25,20,26,30] height = [150,160,180,170] names = ['Mārtiņš','Aiga','Kristiāns','Valters'] column_names = ['Age','Height','Names']
list_cols = [name, age, height]
zipped = list(zip(column_names,list_cols))
data = dict(zipped)
df = pd.DataFrame(data)
DataFrame:
Add a new column with values 0
df[‘Salary’] = 0
DataFrame:
Rename columns
df.columns = [‘name’,’school’,age’]
DataFrame:
Rename indexes
df.index = [‘A’, ‘B’, ‘C’]
DataFrame:
Create a dictionary from lists
list1 = [ ] list2 = [ ] column_names = [ ]
columns = [list1, list2]
created_tuples = list(zip(column_names,columns]
created_dict = dict(created_tuples)
Get how many rows and columns the DataFrame has
my_dataframe.shape
Get the column names
my_dataframe.columns
DataFrame: Slice the dataframe. 1) first 5 rows 2) last 5 rows 3) columns 3 to 5 included 4) each 3rd row
my_data.iloc[:6,:]
my_data.iloc[-5:,:]
my_data.iloc[:,3:6]
my_data.iloc[::3,:]
DataFrame:
see the first 10 rows quickly
my_data.head(10)
DataFrame:
see the last 8 rows quickly
my_data.tail(8)
DataFrame:
see the column names, their types and count quickly
my_data.info()
DataFrame:
assign a value to some element in DataFrame
my_data.iloc[5,10] = 29
DataFrame:
assign NaN to every 3rd row in the last column.
Which rows will be affected?
import numpy as np
my_data.iloc[::3,-1] = np.nan
nan unchanged unchanged nan unchanged unchanged nan ...
DataFrame:
transform DataFrame to numpy array
my_data.values
DataFrame:
get the index column
my_data.index
DataFrame:
Create a new column with some values
df[‘new_col’] = 0
DataFrame:
Assign new names for the columns
df.columns = [‘name’,’surname’,’age’]
DataFrame: Fully define pandas csv import with 1) custom column names, 2) what multiple values indicate invalid data, 3) what symbol separates values, 4) transform a column to datetime, 5) how not to show first few rows.
path = /folder/file.csv
col_names = [‘name’, ‘surname’, ‘age’]
pd.read_csv(path, header = none, names = col_names, na_values = ‘-1’)
DataFrame:
If csv has year, month, day in separate columns, how to read_csv so tha it combines the 3 columns in one.
pd.read_csv(path, parse_dates = [[0 , 1 , 2]] )
DataFrame:
Get rid of one column in dataframe. How would you do it?
If the df has more columns than needed, define meaningul cols names
meaningful_columns = [‘name’, ‘surname’]
assign the columns to itself
df = df[meaningful_columns]
DataFrame:
Export the the dataframe to csv or excel
path = ‘my_file.csv’
df.to_csv(path)
path2 = ‘my_file2.xlsx
df.to_excel(path2)
DataFrame:
If data has a column with dates, that you want to have as index how would you import that csv
pd.read_csv(path, index_col = ‘dates’, parse_dates = True)
DataFrame:
Have the name of the plot lines on the plot.
df[‘open’].plot(legend = True)
df[‘close’].plot(legend = True)
plt.show()
DataFrame:
Plot specific columns (not the index).
df. plot( x = ‘Month’, y = [‘salary’, ‘overhead’]
plt. show()
DataFrame:
Create a scatter plot with different sizes of dots.
df. plot(kind = ‘scatter’, x = ‘year’, y = ‘age’, s = df[‘size’])
plt. show()
DataFrame:
Plot two columns separately. Create a box plot.
df[‘first’, ‘second’].plot(kind = ‘box’, subplots = True)
plt.show()
DataFrame:
Create a CDF and PDF plots in two rows. Scale the vertical axis. Change the horizontal division - make it finer. State the horizontal from … to… values.
fig, axes = plt.subplots(nrows=2, ncols=1)
Plot the PDF
df. fraction.plot(ax=axes[0], kind=’hist’, normed = True, bins = 30, range=(0,.3))
plt. show()
Plot the CDF
df. fraction.plot(ax = axes[1], kind = ‘hist’, normed=True, cumulative = True, bins = 30, range=(0,.3))
plt. show()
DataFrame:
Get the statistical information about the dataset.
df.describe()
DataFrame:
How many non nul entries in a DataFrame column?
df[‘my_col’].count()
DataFrame:
Get the mean of a DataFrame column. What does it do to null values?
df[‘my_col’].mean()
DataFrame:
Get the standart deviation of a DataFrame column. What does it do to null values?
df[‘my_col’].std()
DataFrame:
What is a median? How to get it for a DataFrame column?
df[‘my_col’].median()
DataFrame: First quartile : 25% Second quartile : ....? Third quartile : ....? Get the value for 20 % percentile.
df.quantile(0.2)
DataFrame:
Get the interquantile range between 0.2 and 0.8
df.quantile([0.2, 0.8])
DataFrame:
Get the minimum and maximum values.
df. min()
df. max()
DataFrame:
Get the mean value calculated over
1) each row
2) each column
#mean calculated for each row df.mean(axis = 1)
#mean calculated for each column df.mean(axis = 0)
DataFrame:
What information can you get with df[‘species’].describe() Species column has categorical values.
df['species'].describe() give count top unique frequency - occurrences of top
DataFrame:
Get list of values in a column without repeating.
df[‘species’].unique()
DataFrame:
Add transparency to a plot
df. plot(alpha = 0.8)
plt. show()
DataFrame:
Create two separate DataFrames from one by subsetting- one DF has rows about ‘Dog’ and the other one about ‘Cat’. Main DF has a column animal.
dog_indices = df['Animal'] == 'Dog' cat_indices = df['Animal'] == 'Cat' dog_df = df.loc[dog_indice] cat_df = df.loc[cat_indice]
DataFrame:
Df has a index column with datetime.
1) Get rows in a specific datetime range from a specific column.
2) Get rows on a specific day from a specific column.
df. loc[‘2019-02-25’:’2019-03-25’, ‘product’]
df. loc[‘2019-02-25’, ‘product’]
DataFrame:
Df has a index column with datetime.
1) Get rows in a specific datetime range from a specific column.
2) Get rows on a specific day from a specific column.
df. loc[‘2019-02-25’:’2019-03-25’, ‘product’]
df. loc[‘2019-02-25’, ‘product’]
DataFrame:
Transform a list of datetime strings to df datetime
string = [‘2019-02-25’,’2019-03-25’]
pd.to_datetime(string)
DataFrame:
Have a dataframe df1 adapt to a new datetime index from another dataframe df2.
What will be the result if df1 doesn’t have data on a dates in df2?
df.reindex(df2.index)
DataFrame:
Have a dataframe df1 adapt to a new datetime index from another dataframe df2. If df1 doesn’t have data on all df2 dates, have the empty places be the nearest preceeding entry in df1.
df.reindex(df2.index, method = ‘ffill’)
DataFrame:
Have a dataframe df1 adapt to a new datetime index from another dataframe df2. If df1 doesn’t have data on all df2 dates, have the empty places be the nearest next entry in df1.
df.reindex(df2.index, method = ‘bfill’)
DataFrame:
List contains this:
time_list = [‘20100101 00:00’, ‘20100101 01:00’, ‘20100101 02:00’]
How to transform it to datetime?
pd.to_datetime(time_list, format = ‘%Y-%m-%d %H:%M’)
DataFrame:
We have a dataframe df1 that has hourly entries.
Create a df2 that show daily mean, sum, max, etc.
How to get second, minute, hourly, daily, weekly, monthly, yearly calculations?
How do you get calculations over say 3 hour periods?
df2 = df1.resample(‘D’).mean()
Seconds S mins T or min hours H days D weeks W Months M Yearly A
To get 3 hour periods use 3H.
DataFrame:
How to get the maximum value of mean values in a month if the df has daily entries?
df.resample(‘M’).mean().max()
DataFrame:
Get the rolling mean values over defined period of time.
df.rolling(window = 24).mean()
DataFrame:
read_csv and specify which column should be transformed to datetime objects.
pd.read_csv(path, parse_dates = [‘my dates’])
DataFrame:
Create a df series that has it’s text in upper or lower case.
df[‘company’].str.upper()
DataFrame:
Create a df series that has True False for rows that contain a word ‘ware’.
df[‘company’].str.contains(‘ware’)
DataFrame:
Count the number of rows that has the letter combination ‘ware’ in a column.
df[‘product’].str.contrains(‘ware’).sum()
the first par return series of True / False. True + True = 2 etc.
DataFrame:
Extract the value of hours from df datetime column.
df[‘my date’].dt.hour
DataFrame:
df has the population of the world for every decade. Upsample the df to every year and fill in the blanks with linearly changing values.
df.resample(‘A’).first().interpolate(‘linear’)
DataFrame:
Strip the whitespaces from a df header.
df.columns = df.columns.str.strip()
DataFrame:
Two columns contain date and time. Create series to combine both and have the type of series as Datetime.
pd.to_datetime( la[‘Date (MM/DD/YYYY)’] + ‘ ‘ + la[Wheels-off Time’] )
DataFrame:
How to define line style in plots?
What are the options?
df.plot(style = ‘k.-‘)
colors: k - black b - blue c - cyan r - red
marker type: . dots o circles s squares * stars
line type:
: dotted
- solid
DataFrame:
Have everything bellow a plot line in a color.
df.plot(kind = ‘area’)
DataFrame:
You have a df with columns:
index date temperature
make the date column become the index column.
df.set_index(‘date’, inplace = True)
DataFrame:
You have a string, that contains all names for the columns.
String looks like this: cols = ‘Wban,date,Time,StationType,sky_condition,sky_conditionFlag’
How would you get this become the header of a datafram?
df_cols = cols.split(',') df.columns = df_cols
DataFrame:
You have a list of column names that you want to remove from a df. How to do it?
df_dropped = df.drop(list_to_drop, axis = 1)
DataFrame:
Convert a column to a different data type - a string
df_dropped[‘date’] = df_dropped[‘date’].astype(str)
DataFrame:
If you want to have df indexes to be regular 1, 2 , 3 … etc.
my_series.reset_index()
DataFrame:
Find if there is a relationship between columns!
df.corr()
DataFrame:
extract df2 from df1 such that df2 contains only the rows where df1 column’s values are above, say, 10.
df2 = df1.loc[df1[‘column name’] > 10]
Create a DataFrame with 3 columns:
names = [‘United States’, ‘Australia’, ‘Japan’, ‘India’, ‘Russia’, ‘Morocco’, ‘Egypt’]
dr = [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt'] dr = [True, False, False, False, True, True, True] cpc = [809, 731, 588, 18, 200, 70, 45]
my_dict = {‘country’: names, ‘drives_right’: dr, ‘cars_per_cap’: cpc}
cars = pd.DataFrame(my_dict)
Extract a row from dataframe in index column as dataframe
1) by a label
2) by number
1) df.loc[[‘ru’]]
2) df.iloc[[2]]
Extract multiple rows and columns by their column and row labels.
1a) specific columns and rows by label
1b) all rows and specific columns by label
1a) specific columns and rows by number
1b) all rows and specific columns by number
1a) df.loc[[‘ru’,’de’],[‘country’,’capital’]]
1b) df. loc[:,[‘country’,’capital’]]
2a) df.iloc[[1,2],[2,3]]
2b) df.iloc[:,[2.3]]
given:
dataframe with index | country | cars_per_cap
output:
for each row print:
“country: cars_per_cap”
for lab, row in cars.iterrows() :
print(lab + ‘: ‘ + str(row[‘cars_per_cap’]))
given:
dataframe with index | country
output:
append the dataframe with a new column COUNTRY that has country in upper case
cars[‘COUNTRY’] = cars[‘country’].apply(str.upper)