Pandas Pt 1 (Wk 4 UCSD) Flashcards
What is pandas
a library built on numpy, with flexible data structures
how to use the pandas library
import pandas as pd
What is a series in pandas
A one dimensional dict like structure (index, values), that allows for diff data types, and works w/ most numpy functions
how to declare a series
ser = pd.series( data=[ values], index = [indices] ) (don’t have to say data = and index =)
print the in dices of a series
print (ser.index)
retrieve data from a series at a given index ‘Bob’
ser[ ‘Bob’ ] or ser.loc[ ‘Bob’ ]
retrieve multiple data points in a series with index values
ser[ [ ‘bob’, ‘nancy’ ] ]
retrieve data from series by indexing on position
ser[ [ 1, 2, 3 ] ]
test if a given index is present in a series
‘bob’ in ser»_space; returns boolean
can you perform operations on a series, like you can with arrays?
yes. ser * 2 multiplies all values in series by 2
What is a dataframe
it’s like a 2d series, where indices become row names, and name of each series becomes col names
How do you create a dictionary with multiple sets of series, which you could then assign to a dataframe?
d = {‘one’ : pd.Series([values], index=[indices]),
‘two’ : pd.Series([values], index=[indices])}
create a dataframe using a dictionary of series
pd_dataframe = pd. dataframe(dict_of_series)
retrieve the row names from a dataframe
df.index
retrieve the column names from a dataframe
df.columns
print a dataframe in a nice tabular format
df_var (just list the var, which prints the value of the variable)
create a dataframe with a subset of the rows (indices)
pd.dataframe(df_var, index = [indices] )
create a dataframe with a subset of rows and columns
pd.dataframe(df_var, index = [indices], columns= [column_values] )
what happens if you create a data frame from a list of dictionaries?
the indices from each dictionary become the column names, and the rows represent each dict (opp. of when you build w/ dict. of series)
retrieve the values of a column from a dataframe, w/ a column name of ‘two’
df [ ‘two’ ]
create a third col in a df that equals the product of df columns ‘one’ and ‘two’
df[ ‘three’ ] = df[‘one’] * df[‘two’]
create a bool col ‘flag’ based on the values of col ‘two’ that are greater than 100
df[‘flag’] = df[‘two’] > 100
retrieve the value and delete a column from a df
three = df.pop(‘three’)
how to delete col ‘two’ from a dataframe
del df[‘two’]
append a new column onto the end of a df
df.insert(2, ‘copy_of_one’, df[‘one’]) (where two would be the position of the next column)
function to read csvs, json, html into pandas
read_csv, read_json, read_html, read_sql_query, read_sql_table
function to read json into pandas
read_json
retrieve the values in a dataframe from row 1 and 2
df_var[ 1:2 ] df_var.iloc[ [1,2] ] ## without iloc, have to slice (can’t just say [1] for instance
ingest csv data into a data frame
df_var_csv = pd.read_csv(‘filename’, sep = ‘ , ‘ ) ## comma separated
slice out column with name ‘ratings’ from a dataframe
df[ ‘ratings’ ]
specify a column as the index, and then retrieve values on that index
movies = movies.set_index(“movieId”)
movies.loc[ 1 ] ## using the movieId as index now