Pandas Flashcards
read csv file into dataframe data
data = pd.read_csv(‘weights_heights.csv’, index_col=’Index’)
plot histogram
data.plot(y=’Height’, kind=’hist’,
color=’red’, title=’Height’)
look first five records in dataframe
data.head(5)
use lambda function to create new column in dataframe, which is result of function acting on two other columns
def make_bmi(height_inch, weight_pound): METER_TO_INCH, KILO_TO_POUND = 39.37, 2.20462 return (weight_pound / KILO_TO_POUND) / \ (height_inch / METER_TO_INCH) ** 2
data[‘BMI’] = data.apply(lambda row: make_bmi(row[‘Height’],
row[‘Weight’]), axis=1)
create new column with categories for other column
def weight_category(weight): if weight < 120: return 1 elif weight > 150: return 3 else: return 2
data[‘weight_category’] = data[‘Weight’].apply(weight_category)
plot scatterplot
data.plot(‘Weight’, ‘Height’, kind=’scatter’,title=’Height/Weight’)
look on statistics for features in dataframe
data.describe()
create new dataframe X_sub using 3 columns of dataframe data
X_sub = data.iloc[:,[0, 1, 2]]
create numpy array X_np from pandas dataframe data
X_np = data.values
load big data file ‘checkins.dat’ into python
checkins = pd.read_csv(‘checkins.dat’, header=0, skipinitialspace = True, names=[‘lat’, ‘lng’], usecols = [3,4], engine=’python’, sep = ‘|’, skipfooter=1)
drop rows of Pandas DataFrame whose value in certain columns is NaN
(Cliffs:Just take rows where EPS is finite)
1) df = df[np.isfinite(df[‘all_integer’])]
2) df.dropna() #drop all rows that have any NaN values
3) df.dropna(how=’all’) #drop only if ALL columns are NaN
4) df[df.all_integer.notnull()]