Pandas Flashcards
Filter df for when Column is null
df[ df.Column.isnull() ]
Filter df for when Column is not null
df[ df.Column.notna() ]
Create a boolean series for when colA >100 AND colB <0
(df.colA >100 & df.colB <0)
Create a boolean series for when colA >100 OR colB <0
(df.colA >100 | df.colB <0)
Return a dataframe’s data types
df.dtypes
Return the dimensions of a dataframe
df.shape
Rename column MANUfacturer as ‘manufacturer’
df.rename( columns= {‘MANUfacturer’ : ‘manufacturer’}, inplace=True)
Convert a string column to a float
df[‘column’] = df[‘column’].astype(float)
Extract first prefix when string column is split by a dash
df.column.str.split(‘-‘).str[0]
Replace values in Column using a mapping dictionary
df.column = df.column.map( {‘Key’ : ‘newkey’, ‘Key1’ : ‘newkey1’ } )
Export dataframe to csv file without index values
df.to_csv(‘filename.csv’, index=False)
Get meta-data information for the columns of a dataframe
df.info()
Get the name of the columns in a dataframe
df.columns
Get descriptive statistics for a column
df.column.describe()
Get frequencies for each unique value in a column
df.column.value_counts()
Get the averages of col_B grouped by col_A
df.groupby(df.col_A).col_B.mean()
Apply the size, min, and max functions to the dataframe grouped by col_A
df.groupby(df.col_A).agg( [‘size’, min, max] )
Create a pivot table where col_V1 is sumed and col_V2 shows the min and max. Have col_1 and col_2 as rows and col_A and col_B as columns. Include grand totals.
df.pivot_table( values=['col_V1', col_V2'], index=['col_1', 'col_2'], columns=['col_A', 'col_B'], aggfunc={ 'col_v1': sum, 'col_V2': [min, max] }, margins=True)
List the index (aka row labels) of a dataframe
df.index
Convert a dataframe (or series) into a numpy array
df.to_numpy()
Assign colA and colB as the index (multiIndex) of the dataframe
df.set_index( [‘colA’, ‘colB’], inplace=True)
Vertically append two dataframes and assign an additional index indicating which df the row came from
pd.concat( [df1,df2], keys=[1, 2])
Do an inner join on the indexes two dataframes and add a suffix to duplicated column names
df1.merge(df2,
left_index=True,
right_index=True,
suffixes= (‘_df1’,’_df2’))
Apply a function element-wise to a series
df. col_name.apply(function_name)
- - OR –
df. col_name.map(function_name)
Apply a function element-wise to a dataframe
df.applymap(function_name)
Apply a function along the columns of a dataframe
df.apply(function_name)
Unpivot a dataframe and rename the variables as ID and the values as ‘fact’
df.melt( id_vars=[col1, col2],
value_vars=[col3, col4], #defaults to all non-id_vars
var_name= ‘ID’,
value_name = ‘fact’)
Return a boolean mask if a regex pattern is found in a certain column
df[col_name].str.contains(pattern)
Extract a regex capture group from a column
df[col_name].str.extract(pattern)
Extract more than one group of patterns from a column
df[col_name].str.extract(pattern_with_multiple_capture_groups)
Replace a regex or string in a column with another string
df[col_name].str.replace(pattern, replacement_string)
Calculate the number of missing values in each column
df.isnull().sum()
Drop rows with any missing values
df.dropna()
Drop specific columns
df.drop(columns_to_drop, axis=1)
Drop columns with less than a certain number of non-null value
df.dropna(thresh = min_nonnull, axis=1)
Replace missing values in a column with another value
df[col_name].fillna(replacement_value)
Show all duplicate rows in a dataframe
df[ df.duplicated( keep=False ) ]
Drop rows with duplicate values in only certain columns. Keep the last duplicate row
df.drop_duplicates( [col_1, col_2], keep=’last’)
Replace values of column_A with values of column B when column_A is less than zero
df. column_A = df.column_A.mask(
df. column_A < 0,
df. column_b)
Resetting the index
df.reset_index(inplace=True)
Do a left join on a shared column named ‘ID;
df1.merge( df2, on=’ID’, how-‘left’)
Fill in missing values of a datafram with zeros
df.fillna(0, inplace=True)
Find the correlations between columns in a dataset
df.corr()
Convert a column if a dataframe into a list
new_list = df.column.tolist()
Get all rows of a dataframe where the value of a column is not in the elements of a list
df[ ~df.column_name.isna( [list_values] ) ]
Sort a dataframe by a col_A descinding and col_B ascending and reset the index
df.sort_values( [‘col_A’, ‘col_B’],
ascending=[False, True],
ignore_index=True,
inplace=True)