Module 5: Intro to Pandas Flashcards
Overview and benefits of Pandas
In this phase of the analytic process, preparing the data, Pandas allows us to organize the data, and label it in a logical manner so that we can begin to understand our sample or population. Also referred to as data exploration. Pandas define data structures and tools for easy and intuitive data analysis. The difference between NumPy and Pandas are that Pandas work with heterogenous or tabular data while NumPy mostly works with homogenous numerical arrays.
- NumPy is really good for numerical operations like broadcasting
- Pandas builds on NumPy and uses it on tabular data (most common format Is csv, in its raw format it is a text file with a comma separated value)
What is a series in Pandas
In this phase of the analytic process, preparing the data, Pandas allows us to organize the data, and label it in a logical manner so that we can begin to understand our sample or population. Also referred to as data exploration. Pandas define data structures and tools for easy and intuitive data analysis. The difference between NumPy and Pandas are that Pandas work with heterogenous or tabular data while NumPy mostly works with homogenous numerical arrays.
- NumPy is really good for numerical operations like broadcasting
- Pandas builds on NumPy and uses it on tabular data (most common format Is csv, in its raw format it is a text file with a comma separated value)
What are dataframes in pandas
A pandas dataframe is a 2-dimensional tabular data structure with labeled columns and rows. You can think of dataframes as a group of pandas series where each series represents a column of data.
df = pd.DataFrame(data=[[8, 128, 27.5],
[10, 138.9, 34.5],
[16, 157.3, 91.1],
[6, 116.6, 21.4],
[14, 159.2, 54.4]],
columns=[‘Age’, ‘Height’, ‘Weight’])
We can use the info function to inspect the dataframe.
df.info()
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
Age 5 non-null int64
Height 5 non-null float64
Weight 5 non-null float64
dtypes: float64(2), int64(1)
memory usage: 200.0 bytes
* There are 5 rows of data in the DataFrame with the index as a range of integers from 0 to 4
* The DataFrame has 3 columns of data, column names are Age, Height and Weight
* All columns have 5 records and there are no Null values in any of the columns
* The first column, Age, is a column of integers, the second and third columns, Height and Weight, are floating point numbers, which is also stated in the dtypes: float64(2), int64(1) line
* The last line will show approximately how many bytes of memory are used by pandas to store this DataFrame
Note: it shows that there are 5 entries, if one of the columns like age had anything less than 5, it would indicate to you that there are missing values, but you can confirm this:
Axes and Indexes
Axes and Indexes
To see dimensions of the dataframe we can use the shape method.
prov_support.shape
output:
(13, 4)
The columns method shows a list of all the columns and the index method returns the dataframes index.
prov_support.columns
prov_support.index
an important concept is axes, where axis=0 it is the rows index, and where axis=1 is the full set of column names.
Many methods in pandas use axis as a parameter. Let’s see an example, Pandas has the method sort_index() which will sort a DataFrame based on either index or column headers.
For example, if we want to sort the index based on ascending alphabetical order since they are strings, you would:
prov_support.sort_index(axis=0, ascending=True)
province_name 2016 2017 2018
province
AB Alberta 5772 5943 6157
BC British Columbia 6482 6680 6925
MB Manitoba 3531 3675 3965
NB New Brunswick 2741 2814 2956
NL Newfoundland and Labrador 724 734 750
NS Nova Scotia 3060 3138 3201
NT Northwest Territories 1281 1294 1319
NU Nunavut 1539 1583 1634
ON Ontario 21347 21101 21420
PE Prince Edward Island 584 601 638
QC Quebec 21372 22720 23749
SK Saskatchewan 1565 1613 1673
YT Yukon 946 973 1006
If you want to do the same thing for the columns all you would do is set the parameter axis=1
Hierarchical indexes in Dataframes:
Pandas also supports hierarchical indexes which are multi-level indexes. To introduce the concept here it is:
df_hierarch = pd.DataFrame(data=[4, 7, 2, 5, 6],
columns=[‘Data’],
index=
[[‘a’, ‘a’, ‘b’, ‘b’, ‘a’],
[‘x’, ‘y’, ‘x’, ‘y’, ‘x’]])
Data a x 4 y 7 b x 2 y 5 a x 6
df_hierarch.index
outputs:
MultiIndex(levels=[[‘a’, ‘b’], [‘x’, ‘y’]],
codes=[[0, 0, 1, 1, 0], [0, 1, 0, 1, 0]])
this has two levels, one level has indexes “a” and “b”, and the second level, has “x” and “y”
Applying functions to columns
Often times you want to perform some sort of functions on your rows or columns. If we want to calculate percentage change between the years 2017 to 2018 for example:
def percent_change(years):
yr2017, yr2018 = years
return (yr2018 - yr2017)/yr2017 * 100
this function takes one parameter which is expected to be a pair of data values and assigns them to the yr2017 and yr2018.
prov_support[[‘2017’, ‘2018’]].apply(percent_change, axis = 1)
province
NL 2.179837
PE 6.156406
NS 2.007648
NB 5.046198
QC 4.529049
ON 1.511777
MB 7.891156
SK 3.719777
AB 3.600875
BC 3.667665
YT 3.391572
NT 1.931994
NU 3.221731
dtype: float64
We can create a new column with the values calculated above and add it to the DataFrame:
prov_support[‘per_change’] = prov_support[[‘2017’, ‘2018’]].apply(percent_change, axis = 1)
prov_support
Pandas have an applymap() function which will apply another function to every element in the selected dataframe. If we want to format all number columns as floating point numbers, we can use the lambda function:
prov_support.loc[:,’2017’:’per_change’].applymap(lambda x: ‘%.2f’ % x)
Data aggregation and grouping in Pandas
The term split-apply-combine breaks down a big problem into manageable pieces, operates on each piece independently, and then puts all the pieces back together”
- the data is split into groups based on one or more keys. The keys for splitting are based on one of the axes of the dataframe, either rows axis=0 or columns axis=1
- the next step is apply when a function is applied to each group. The function can perform aggregate, transformation, or filtering operations. In the example above the data is summarized
- in the combine step, the results of the apply functions are merged into a result object which can be an array, dataframe, or series
examples of the functions for the apply step include:
- aggregation: compute a summary statistic for each group , compute sum or mean
- transformation” perform group-specific computations, standardize the data within a group, or replace Nas within a group based on a value of calculated from the data within this group
- filtration: discard some groups, based on group-wise computation that evaluates to true or false, filter out data based on a condition or based on the group sum/mean
Groupby function
First, let’s create a dataframe of a popular dataset:
iris = pd.read_csv(‘iris.data’, sep=’,’,
header=None, # the data file does not contain a header
names=[‘sepal length’,’sepal width’,’petal length’,’petal width’,’class’] # names of columns
)
iris.head()
sepal length sepal width petal length petal width class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa
Then investigate the data:
iris.info()
<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal length 150 non-null float64
sepal width 150 non-null float64
petal length 150 non-null float64
petal width 150 non-null float64
class 150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
- 5 columns of data,
- 4 columns contain numeric data,
- Object in pandas is a string so the column class is of the string data type,
- there are no Null values,
- and there are 3 classes of irises with 50 rows of data each for a total of 150 rows.
Since the class attribute contains repeating values, we can use it as a key to group the data by class, suppose we want to compute means of all attributes for each class of the flowers which will become our key.
Groupby() method is used to split the data into groups:
iris.groupby(‘class’)
For simplicity, let’s create a new variable for the GroupBy object
iris_grouped = iris.groupby(‘class’)
iris_grouped.groups
output:
{‘Iris-setosa’: Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
dtype=’int64’),
‘Iris-versicolor’: Int64Index([50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66,
67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99],
dtype=’int64’),
‘Iris-virginica’: Int64Index([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149],
dtype=’int64’)}
as expected there are 3 groups by the 3 values of the key class.
If we need to look at one group, we use the get_group() method.
# This method of the GroupBy object will output the content of a single group
iris_grouped.get_group(‘Iris-versicolor’)