M04 - Pandas Flashcards
Series
- One-dimensional, labeled array capable of holding any data type
- Data is linear and has an index that acts as a key in the dictionary
Series syntax
list_var = [‘list’]
series_var = pandas^.Series(list_var)
^ = pandas can be whatever alias you assign it when importing the dependency
Retrieve a series syntax
series_var
DataFrame
-2-dimensional labeled data structure w/ rows and columns of potentially different data types where data is aligned in a table
DataFrame from Dictionary syntax
var_df = pandas^.DataFrame(dict_var)
^ = pandas can be whatever alias you assign when importing the dependency
Retrieve a DataFrame syntax
var_df
DataFrame naming best practices
Name with “_df” at the end to distinguish DataFrames from Series and Variables
DataFrame from List(s) syntax
#Create empty _df var_df = pd.DataFrame( ) #Add List to _df var_df['Column Header of my Choosing'] = list_var
3 Main Parts of a DataFrame & how you can access them
- Columns: the top/header rows
- Index: Numbers down the left-hand margin
- Values: values in the columns (the data)
Can be accessed w/ the columns, index, and values attributes
Columns attribute syntax + Output
var_df.columns
Index( [‘Column1’ , ‘Column2’ , ….] ), dtype = ‘object’
Object may be other data type? tbd
Index attribute syntax + output
var_df.index
RangeIndex(start = 0 , stop = endIndex , step = increment)
i.e. var_df has 5 entries, incremented by 1
RangeIndex(start = 0, stop = 5, step = 1)
Values attribute syntax + output
var_df.values
Outputs the values without column names (ex. below has 3 columns ID, School, Type):
array( [ [ 0, ‘Huang High School’ , ‘District’ ] ,
[1, ‘Figueroa High School’ , ‘District’] , … ] dtype = object)
Convert csv file into DataFrame syntax/example
# Declare filename variable for csv file_to_load = os.path.join('path' , 'filename.csv')
#Create DataFrame file_data_df = pd.read_csv(file_to_load)
head( ) and tail( ) methods: syntax + what they do
var_df.head( ) - returns top 5 rows of DF
var_df.tail( ) - returns last 5 rows of DF
inserting a number in the ( ) will return that many rows from top/bottom i.e. var_df.head(10) will return top 10 rows
count( ) method: what it does + syntax
Provies a count for the rows for each column containing data. “Null” values are not counted by default.
var_df.count( )
isnull( ) method: what it does + syntax
Determines empty rows. Returns boolean T/F. True if empty, False if not.
var_df.isnull( )
sum( ) method w/ isnull() or notnull(): what it does + syntax + output
Gets total number of empty rows that are marked as “True”
var_df.isnull( ).sum( )
Outputs all column names and sum of “True” values in each column
notnull( ) method: what it does + syntax
Returns T/F, w/ “True” for not empty and “False” if it’s empty value
NaN in a DataFrame
Means ‘not a number’ and cannot be equal to zero
Options for Missing Data
- Do Nothing
- Drop the Row
- Fill in the Row
Do Nothing (missing data) considerations
- NaNs will not be considered in the sum or averages
- If we wish to multiply/divide with a row that has a NaN, the answer will be NaN
Drop the Row (missing data) considerations
- Removing the row removes the all data in associated with that row
1. How much data would be removed if NaNs are dropped?
2. How much would this impact the analysis?
Method to drop a row with NaNs + syntax + note about indexes
dropna( )
var_df.dropna( )
-Indexes do not reset automatically (0, 1, 2, 3) w/ 2 dropped is now (0, 1, 3)
Fill in the Row (missing data) considerations
- Must be used with caution
- Must carefully consider the values you insert for every downstream analysis perfomed
Method to fill in rows w/ NaNs + syntax
fillna( )
var_df.fillna(value/’value’)
6 Common Data Types
- Boolean
- Integer (32bit)
- Integer (64bit)
- Float
- Object
- Datetime
Boolean (Pandas Name/Ex.)
Name: bool
Ex: True and False
Integer (Pandas Name/Ex for both)
Name: int32 or int64
Ex. int32: -2,147,483,648 to 2,147,483,647
Ex. int64: -9,223,3720,036,854,775,808 to 9,223,372,036,854,775,807
Float (Pandas Name/Ex)
float64
Floating Decimal
Object (Pandas Name/Ex)
Name: O or object
Ex: Typically strings; often used as a catchall for columns w/ different data types or other Python objects like tuples, lists, an dictionaries
Datetime (Pandas Name/Ex)
datetime64
Ex: Specific moment in time w/ nanosecond precision
2019-06-03 16:04:00.465107
dtypes attribute + syntax + output
Lets you check the data type of each column on a DataFrame
var_df.dtypes
Returns Column headers w/ Pandas Data name
Both syntaxes: dtypes on specific column
If column has NO SPACES in name:
var_df.column.dtypes
If column has SPACES:
var_df[‘column name’].dtypes
Cleaning/Testing code solutions best practices
Create a copy of or new, separate file for cleaning/testing from the source code you are working on.
tolist( ) method - what it does + syntax
-Will add all data in a specified column to a list
tolist_var = var_df[ “Column Name”].tolist( )
split( ) method - what it does + syntax
- Will split a Python string object on whitespace, or where there is no text
var. split( )
Get length of a split syntax
len(var.split( ) )
set( ) method - what it does + syntax
-Returns all unique items/values in a LIST when the list is added inside parentheses
set(list_var)
strip( ) method - what it does + syntax
- Removes any combination of letters and words that are inside the parentheses
var. strip(“value”)
replace( ) method - what it does + syntax
- Replaces and ‘old’ phrase/string with a new one
var. replace(‘Old’ , ‘New’)
merge( ) method - what it does + syntax
Merges two DataFrames on a common column (think Join)
merged_var_df = pandas.merge(var1_df , var2_df, on = [‘var1_df_columnheader’ , ‘var2_df_columnheader’] )
If you have 2 DataFrames with same info but different column titles, you should:
-Rename the columns to match, this helps avoid duplicate columns or merging issues
unique( ) method - what it does (note the output type it provides) + syntax
-Returns an ARRAY or LIST of all unique values in a given column of a DATAFRAME
varX_df = var_df[‘column_name’].unique( )
Get a count of unique values in a DataFrame column syntax
len(var_df[‘column_name’].unique( ) )
Method + syntax for getting the average of a column
mean( )
var_df[‘column_name’].mean( )
map( ) function - what it does + syntax
- Used for substituting each value in a Series with another value. Where the new value is generated from a function, a dictionary, or a Series
- Note, if there are multiples of a current value, you only need to map it once it will change all instances to the new value
series_var.map( { ‘current value1’ : ‘new value1’ , ‘current value2’ : ‘new value 2’ , … } )
Function + 4 Basic Parts
- Smaller, more manageable piece of code
- Good for repetitive tasks
1. The name, which is what we call the function
2. The parameters, which are values we send to the function
3. The code block, which are the statements under the function that perform the task
4. The return value, which is what the function gives back, or ‘returns’ to use when the task is complete
Function Syntax
def fxn_name( ): (tab)instructions
format( ) function - what it does + syntax
- Used to format a value to a specific format
- I.e. decimal places, adding separators, etc.
“{value : format specification}”.format(value)
Ex. format 92.34 held as my_var
print(““.format(my_var)
Output: 92
Reorder Columns syntax
#Set var w/ column order how you want new_column_order = ['col2' , 'col4' , 'col1' , 'col3']
# Assign a new or same DataFrame to the new column order var_df = var_df[new_column_order]
set_index( ) method - what it does + syntax
-Returns a series wit the index set to a specified column
var_name = var_df.set_index( [ ‘column_name1’ ] ) [“column_name2”]
column_name1 will be the index, column_name2 will be the variable
value_counts( ) method - what it does + syntax
-Returns a Series that counted + totals each unique entry in a column
var_Series = var_df[ ‘column_to_count’].value_counts( )
grouby( ) function - what it does + syntax (with mean( ) )
-Splits an object (like a DataFrame), apply a mathematical operation, and combine the results
var = var_df.groupby( [‘column_name’] ).mean( )
sort_values( ) function - what it does + syntax
- Sorts values in a DataFrame or Series for a given text, index, or column that is passed within the parentheses
- Can add parameter named ‘ascending’ (type: bool), default is ascending=True
var = var_df.sort_values([‘Column_name’], ascending=False)
describe( ) method - what it does/returns
-Runs on DataFrame or Series
-Returns:
+Number of rows in DF or Series
+Average of the rows as mean
+St Dev of the rows as std
+Minimum value of the rows as min
+25th percentile as 25%
+50th percentile as 50%
+75th percentile as 75%
+Maximum value of the rows as max
describe( ) syntax
var_df.describe( )
cut( ) function - what it does + syntax
- Segments and sorts data values into bins
- When making a variable for ranges, must include a value lower than the lowest value (i.e. 0 in the case of school district analysis)
pandas.cut(var_df , var_ranges)
Add a List or Series to a DataFrame syntax
var_df = pandas.DataFrame({‘Col_Name1’: ‘Col_Values1’ , ‘Col_Name2’ : Col_Values2 , … })