Pandas Multi Index Flashcards
index = pd.MultiIndex.from_tuples(index)
index
index = [(‘California’, 2000), (‘California’, 2010),
(‘New York’, 2000), (‘New York’, 2010),
(‘Texas’, 2000), (‘Texas’, 2010)]
creates
California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 Texas 2000 20851820 2010 25145561
pop = pop.reindex(index)
pop
reasign index
useful for multi level from tuples
pop[:, 2010]
Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:
pop_df = pop.unstack()
pop_df
You might notice something else here: we could easily have stored the same data using a simple DataFrame with index and column labels. In fact, Pandas is built with this equivalence in mind. The unstack() method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame:
pop_df.stack()
Naturally, the stack() method provides the opposite operation:
multi index creation
df = pd.DataFrame(np.random.rand(4, 2),
index=[[‘a’, ‘a’, ‘b’, ‘b’], [1, 2, 1, 2]],
columns=[‘data1’, ‘data2’])
The most straightforward way to construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the constructor. For example:
data = {('California', 2000): 33871648, ('California', 2010): 37253956, ('Texas', 2000): 20851820, ('Texas', 2010): 25145561, ('New York', 2000): 18976457, ('New York', 2010): 19378102} pd.Series(data)
data = {('California', 2000): 33871648, ('California', 2010): 37253956, ('Texas', 2000): 20851820, ('Texas', 2010): 25145561, ('New York', 2000): 18976457, ('New York', 2010): 19378102} pd.Series(data)
pd.MultiIndex.from_arrays([[‘a’, ‘a’, ‘b’, ‘b’], [1, 2, 1, 2]])
you can construct the MultiIndex from a simple list of arrays giving the index values within each level:
pd.MultiIndex.from_tuples([(‘a’, 1), (‘a’, 2), (‘b’, 1), (‘b’, 2)])
You can construct it from a list of tuples giving the multiple index values of each point:
pd.MultiIndex.from_product([[‘a’, ‘b’], [1, 2]])
You can even construct it from a Cartesian product of single indices:
pd.MultiIndex(levels=[[‘a’, ‘b’], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Similarly, you can construct the MultiIndex directly using its internal encoding by passing levels (a list of lists containing available index values for each level) and labels (a list of lists that reference these labels):
pop.index.names = [‘state’, ‘year’]
pop
Sometimes it is convenient to name the levels of the MultiIndex. This can be accomplished by passing the names argument to any of the above MultiIndex constructors, or by setting the names attribute of the index after the fact:
# hierarchical indices and columns index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names=['year', 'visit']) columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']], names=['subject', 'type'])
In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well. Consider the following, which is a mock-up of some (somewhat realistic) medical data:
Here we see where the multi-indexing for both rows and columns can come in very handy. This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number. With this in place we can, for example, index the top-level column by the person’s name and get a full DataFrame containing just that person’s information:
Benefit of multi dimensional indexing across columns and rows
indexing
pop[‘California’, 2000]
We can access single elements by indexing with multiple terms:
pop.loc[‘California’:’New York’]
Partial slicing is available as well, as long as the MultiIndex is sorted
pop[:, 2000]
With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index:
pop[pop > 22000000]
Other types of indexing and selection (discussed in Data Indexing and Selection) work as well; for example, selection based on Boolean masks:
pop[[‘California’, ‘Texas’]]
Selection based on fancy indexing also works:
multiply indexed fram
health_data[‘Guido’, ‘HR’]
(from multi dimension columns)
Remember that columns are primary in a DataFrame, and the syntax used for multiply indexed Series applies to the columns. For example, we can recover Guido’s heart rate data with a simple operation:
health_data.iloc[:2, :2]
Also, as with the single-index case, we can use the loc, iloc, and ix indexers introduced in Data Indexing and Selection
health_data.loc[:, (‘Bob’, ‘HR’)]
These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in loc or iloc can be passed a tuple of multiple indices. For example:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, ‘HR’]]
You could get around this by building the desired slice explicitly using Python’s built-in slice() function, but a better way in this context is to use an IndexSlice object, which Pandas provides for precisely this situation. For example:
data = data.sort_index()
data
need to sort multi index prior to slicing operations
error if not properly sorted
pop. unstack(level=0)
pop. unstack(level=1)
As we saw briefly before, it is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use:
pop.unstack().stack()
The opposite of unstack() is stack(), which here can be used to recover the original series:
pop_flat = pop.reset_index(name=’population’)
pop_flat
Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the reset_index method. Calling this on the population dictionary will result in a DataFrame with a state and year column holding the information that was formerly in the index. For clarity, we can optionally specify the name of the data for the column representation:
pop_flat.set_index([‘state’, ‘year’])
Often when working with data in the real world, the raw input data looks like this and it’s useful to build a MultiIndex from the column values. This can be done with the set_index method of the DataFrame, which returns a multiply indexed DataFrame:
data_mean = health_data.mean(level=’year’)
data_mean
We’ve previously seen that Pandas has built-in data aggregation methods, such as mean(), sum(), and max(). For hierarchically indexed data, these can be passed a level parameter that controls which subset of the data the aggregate is computed on.
data_mean.mean(axis=1, level=’type’)
By further making use of the axis keyword, we can take the mean among levels on the columns as well: