Pandas Flashcards
DataFrame:
Create a dataframe from scratch
First create lists age = [25,20,26,30] height = [150,160,180,170] names = ['Mārtiņš','Aiga','Kristiāns','Valters'] column_names = ['Age','Height','Names']
list_cols = [name, age, height]
zipped = list(zip(column_names,list_cols))
data = dict(zipped)
df = pd.DataFrame(data)
DataFrame:
Add a new column with values 0
df[‘Salary’] = 0
DataFrame:
Rename columns
df.columns = [‘name’,’school’,age’]
DataFrame:
Rename indexes
df.index = [‘A’, ‘B’, ‘C’]
DataFrame:
Create a dictionary from lists
list1 = [ ] list2 = [ ] column_names = [ ]
columns = [list1, list2]
created_tuples = list(zip(column_names,columns]
created_dict = dict(created_tuples)
Get how many rows and columns the DataFrame has
my_dataframe.shape
Get the column names
my_dataframe.columns
DataFrame: Slice the dataframe. 1) first 5 rows 2) last 5 rows 3) columns 3 to 5 included 4) each 3rd row
my_data.iloc[:6,:]
my_data.iloc[-5:,:]
my_data.iloc[:,3:6]
my_data.iloc[::3,:]
DataFrame:
see the first 10 rows quickly
my_data.head(10)
DataFrame:
see the last 8 rows quickly
my_data.tail(8)
DataFrame:
see the column names, their types and count quickly
my_data.info()
DataFrame:
assign a value to some element in DataFrame
my_data.iloc[5,10] = 29
DataFrame:
assign NaN to every 3rd row in the last column.
Which rows will be affected?
import numpy as np
my_data.iloc[::3,-1] = np.nan
nan unchanged unchanged nan unchanged unchanged nan ...
DataFrame:
transform DataFrame to numpy array
my_data.values
DataFrame:
get the index column
my_data.index
DataFrame:
Create a new column with some values
df[‘new_col’] = 0
DataFrame:
Assign new names for the columns
df.columns = [‘name’,’surname’,’age’]
DataFrame: Fully define pandas csv import with 1) custom column names, 2) what multiple values indicate invalid data, 3) what symbol separates values, 4) transform a column to datetime, 5) how not to show first few rows.
path = /folder/file.csv
col_names = [‘name’, ‘surname’, ‘age’]
pd.read_csv(path, header = none, names = col_names, na_values = ‘-1’)
DataFrame:
If csv has year, month, day in separate columns, how to read_csv so tha it combines the 3 columns in one.
pd.read_csv(path, parse_dates = [[0 , 1 , 2]] )
DataFrame:
Get rid of one column in dataframe. How would you do it?
If the df has more columns than needed, define meaningul cols names
meaningful_columns = [‘name’, ‘surname’]
assign the columns to itself
df = df[meaningful_columns]
DataFrame:
Export the the dataframe to csv or excel
path = ‘my_file.csv’
df.to_csv(path)
path2 = ‘my_file2.xlsx
df.to_excel(path2)
DataFrame:
If data has a column with dates, that you want to have as index how would you import that csv
pd.read_csv(path, index_col = ‘dates’, parse_dates = True)
DataFrame:
Have the name of the plot lines on the plot.
df[‘open’].plot(legend = True)
df[‘close’].plot(legend = True)
plt.show()
DataFrame:
Plot specific columns (not the index).
df. plot( x = ‘Month’, y = [‘salary’, ‘overhead’]
plt. show()
DataFrame:
Create a scatter plot with different sizes of dots.
df. plot(kind = ‘scatter’, x = ‘year’, y = ‘age’, s = df[‘size’])
plt. show()
DataFrame:
Plot two columns separately. Create a box plot.
df[‘first’, ‘second’].plot(kind = ‘box’, subplots = True)
plt.show()
DataFrame:
Create a CDF and PDF plots in two rows. Scale the vertical axis. Change the horizontal division - make it finer. State the horizontal from … to… values.
fig, axes = plt.subplots(nrows=2, ncols=1)
Plot the PDF
df. fraction.plot(ax=axes[0], kind=’hist’, normed = True, bins = 30, range=(0,.3))
plt. show()
Plot the CDF
df. fraction.plot(ax = axes[1], kind = ‘hist’, normed=True, cumulative = True, bins = 30, range=(0,.3))
plt. show()
DataFrame:
Get the statistical information about the dataset.
df.describe()