M04 - Pandas Flashcards
Series
- One-dimensional, labeled array capable of holding any data type
- Data is linear and has an index that acts as a key in the dictionary
Series syntax
list_var = [‘list’]
series_var = pandas^.Series(list_var)
^ = pandas can be whatever alias you assign it when importing the dependency
Retrieve a series syntax
series_var
DataFrame
-2-dimensional labeled data structure w/ rows and columns of potentially different data types where data is aligned in a table
DataFrame from Dictionary syntax
var_df = pandas^.DataFrame(dict_var)
^ = pandas can be whatever alias you assign when importing the dependency
Retrieve a DataFrame syntax
var_df
DataFrame naming best practices
Name with “_df” at the end to distinguish DataFrames from Series and Variables
DataFrame from List(s) syntax
#Create empty _df var_df = pd.DataFrame( ) #Add List to _df var_df['Column Header of my Choosing'] = list_var
3 Main Parts of a DataFrame & how you can access them
- Columns: the top/header rows
- Index: Numbers down the left-hand margin
- Values: values in the columns (the data)
Can be accessed w/ the columns, index, and values attributes
Columns attribute syntax + Output
var_df.columns
Index( [‘Column1’ , ‘Column2’ , ….] ), dtype = ‘object’
Object may be other data type? tbd
Index attribute syntax + output
var_df.index
RangeIndex(start = 0 , stop = endIndex , step = increment)
i.e. var_df has 5 entries, incremented by 1
RangeIndex(start = 0, stop = 5, step = 1)
Values attribute syntax + output
var_df.values
Outputs the values without column names (ex. below has 3 columns ID, School, Type):
array( [ [ 0, ‘Huang High School’ , ‘District’ ] ,
[1, ‘Figueroa High School’ , ‘District’] , … ] dtype = object)
Convert csv file into DataFrame syntax/example
# Declare filename variable for csv file_to_load = os.path.join('path' , 'filename.csv')
#Create DataFrame file_data_df = pd.read_csv(file_to_load)
head( ) and tail( ) methods: syntax + what they do
var_df.head( ) - returns top 5 rows of DF
var_df.tail( ) - returns last 5 rows of DF
inserting a number in the ( ) will return that many rows from top/bottom i.e. var_df.head(10) will return top 10 rows
count( ) method: what it does + syntax
Provies a count for the rows for each column containing data. “Null” values are not counted by default.
var_df.count( )
isnull( ) method: what it does + syntax
Determines empty rows. Returns boolean T/F. True if empty, False if not.
var_df.isnull( )
sum( ) method w/ isnull() or notnull(): what it does + syntax + output
Gets total number of empty rows that are marked as “True”
var_df.isnull( ).sum( )
Outputs all column names and sum of “True” values in each column
notnull( ) method: what it does + syntax
Returns T/F, w/ “True” for not empty and “False” if it’s empty value
NaN in a DataFrame
Means ‘not a number’ and cannot be equal to zero
Options for Missing Data
- Do Nothing
- Drop the Row
- Fill in the Row
Do Nothing (missing data) considerations
- NaNs will not be considered in the sum or averages
- If we wish to multiply/divide with a row that has a NaN, the answer will be NaN
Drop the Row (missing data) considerations
- Removing the row removes the all data in associated with that row
1. How much data would be removed if NaNs are dropped?
2. How much would this impact the analysis?
Method to drop a row with NaNs + syntax + note about indexes
dropna( )
var_df.dropna( )
-Indexes do not reset automatically (0, 1, 2, 3) w/ 2 dropped is now (0, 1, 3)