Pandas Flashcards
pandas
The most popular library for data analysis in python
What are the two core objects in pandas?
DataFrame
Series
Line of code to import pandas
import pandas as pd
DataFrame
A DataFrame is a table, it contains an array of individual entries each of which has a certain value and corresponds to a row and a column
It can be thought of as a bunch of Series joined together
pd.DataFrame({‘Yes’: [50, 21], ‘No’: [131, 2]})
Yes No 0 50 131 1 21 2
pd.DataFrame()
This is a standard method of producing dataframes
Within the brackets you place a dictionary in which the column names are the keys and whose values are the entries
The list of row labels is known as an index, and it can be used in the dictionary to outline the names of the rows
pd.DataFrame({‘Bob’: [‘I liked it.’, ‘It was awful.’],
‘Sue’: [‘Pretty good.’, ‘Bland.’]},
index=[‘Product A’, ‘Product B’])
Bob Sue
Product A I liked it. Pretty good.
Product B It was awful. Bland.
Series
A sequence of data values
If a DataFrame is a table, a Series is a list. It is in essence a single column of a DataFrame
You can assign row names using the same method as with a DataFrame and the single column name can be assigned using name
e.g. pd.Series([1, 2, 3], index = [‘Mon’, ‘Tue’, ‘Wed’], name = Date)
Reading a dataset into a DataFrame and checking the shape of the data
we use pd.read_csv(file_name)
we then use the .shape attribute to show the shape of the data dataframename.shape
This will return the tuple in the form (number of rows, number of columns)
DataFrameName.head()
Shows the first 5 rows of the DataFrame
Saving a DataFrame as a csv
DataFrameName.to_csv(‘file_name.csv’)
How to access the data held in a column of a dataframe? / How to access a series of a dataframe?
We can access it the same way way we would for values in dictionary e.g. DataFrameName[‘column_name’]
Or we can use dot notation, which is like accessing an attribute/property of a class
DataFrameName.column_name
Accessing an individual data point from a DataFrame?
We use chaining of indices
e.g. DataFrameName[‘column_name’][index number]
We are following the order column first, row second
Using iloc
This is one of pandas methods of retrieving data or columns from a dataframe, they work in the opposite way - row first, column second
It uses indices the same way python normally uses indices
e.g. DataFrame.iloc[row index, column index]
we can still use index slicing etc and we can also pass a list of indices
Using loc
Similar to iloc and again is a way of accessing data but it is a bit simpler
DataFrameName.loc[row number, column_name]
loc uses indexes differently, it uses them inclusively e.g. [0:10] would return all the rows from 0 to 10 including 0 and 10
Manipulating the index by setting the rows of the first column the index
We can use the set_title() method
e.g. DataFrameName.set_title(‘column_name’, inplace = True)
making inplace = True ensure this method makes changes to the dataset
this method will make the rows of the column selected, the new indices
Using pipe | and ampersand for conditional selection
& is used when we are trying to select with this property AND that property
is used when we are saying we want to select data that has this property OR that property
isin() method
This lets you select data whose value is in a list of values
We could combine this with loc as follows
DataFrameName.loc[DataFrameName.column_name.isin([list if values we want that are in the column]]
isnull() method and notnull() method
These two methods help identify all the rows that have no data in a specific column or those that do have data
e.g. DataFrameName.loc[DataFrameName.column_name.isnull()]
describe() method
Provides a high level summary of a specific column of data including the mean and quartiles etc
e.g. DataFrameName.column_name.describe()
unique() method
Provides a list of all the unique values in a column
e.g. DataFrameName.column_name.unique()
value_counts() method
Returns a list of the unique values and how often they occur for a particular column
e.g. df_name.column_name.value_counts()
map() function
Used to substitute the values of a series in a data table in accordance to the other input which could be a function
The function you pass to map() should expect a single value from the Series, and return a transformed version of that value. map() returns a new Series where all the values have been transformed by your function.
e.g. df_name.column_name.map(function)
apply()
Similar to map() except it is used to transform the whole DataFrame using a custom function
It takes the custom function as one of its arguments and axis = 0 or 1 or ‘columns’
When axis = 0 it applies the function to each column, when its 1 it applies the function to each row
When axis = ‘columns’ it applies the function to each row and when axis = ‘index’ it applies the function to each column
e.g. dataframe.apply(customfunction, axis)
or can be used for individual column dataframe.column,apply(function)
idxmax() method
Used to get the row label/the index of the maximum value in a series
Series.idxmax(self, axis, skipna)
axis is only used if we are applying idxmax to the whole dataframe
if skipna = True, then the function will not include NA values
groupby() method
Used to split the data into groups based on the criteria entered within the brackets such as column names etc
Sometimes produces a multi index
agg() method
Lets you run multiple functions on your dataset at once
reset_index() method
Used to convert a multi index to a regular index
sort_values() method
Used to sort a series into an order based on the values it contains, by default this is in ascending order
dataframe.sort_values(by = ‘len’, ascending = True)
We use by to determine how we are sorting the values and if ascending is set to false it will be in descending order
sort_index() method
Used to sort a series into order based on index of rows
e.g. dataframe.sort_index(axis = ‘column name’ or index, ascending = true or false)