pandas2 Flashcards
basic info. Give two methods. One of them also list categorical data/unique values/top counts
df.describe(include = ‘all’) and df.info()
sort by values first in col1 then in col2, col1 ascending and col2 descending.
df = df.sort_values(by = [‘col1’, ‘col2’], ascending = [1, 0]).reset_index(drop = True)
create dummy variable manually
df[‘col’] = df[‘col’].map({‘value1’: 1, ‘value2’: 0})
x is a matrix of a list. you want to use elements in the list
x.columns.values
delete columns
df.drop(columns = [‘col1’, ‘col2’], axis = 1)
the value of the 99th percentile (to use later to remove outliers)
threshold = df[‘col’].quantile(0.99)
create dummy variables for all
new_df = pd.get_dummies(df, drop_first=True)
rearrange columns of a df, and what order should they be?
df.columns.values ## copy output array cols = [PASTE HERE] # and rearrange new_df = df[col]
dependent variable, numeric independent variable, dummies
create a df after linear regression to check test results
y_test = y_test.reset_index(drop=True) df = pd.DataFrame(np.exp(y_hat_test), columns=['Prediction']) df['Target'] = np.exp(y_test) df['Residual'] = df['Target'] - df['Prediction'] df_pf['Difference_perc'] = np.absolute(df_pf['Residual']/df_pf['Target']*100)
set number of rows to display to n
pd.options.display.max_rows = n
display floats to 2 digits
pd.set_option(‘display.float_format’, lambda x: ‘%.2f’ % x)
check for nulls
df.isnull().sum()
turn confusion matrix from statsmodels to pd df. Assume model is results_log
cm_df = pd.DataFrame(results_log.pred_table()) cm_df.columns = ['Predicted 0','Predicted 1'] cm_df = cm_df.rename(index={0: 'Actual 0',1:'Actual 1'})
select all rows and first and second columns
df.iloc[:,1:3]