Pandas Multi Index Flashcards
index = pd.MultiIndex.from_tuples(index)
index
index = [(‘California’, 2000), (‘California’, 2010),
(‘New York’, 2000), (‘New York’, 2010),
(‘Texas’, 2000), (‘Texas’, 2010)]
creates
California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 Texas 2000 20851820 2010 25145561
pop = pop.reindex(index)
pop
reasign index
useful for multi level from tuples
pop[:, 2010]
Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:
pop_df = pop.unstack()
pop_df
You might notice something else here: we could easily have stored the same data using a simple DataFrame with index and column labels. In fact, Pandas is built with this equivalence in mind. The unstack() method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame:
pop_df.stack()
Naturally, the stack() method provides the opposite operation:
multi index creation
df = pd.DataFrame(np.random.rand(4, 2),
index=[[‘a’, ‘a’, ‘b’, ‘b’], [1, 2, 1, 2]],
columns=[‘data1’, ‘data2’])
The most straightforward way to construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the constructor. For example:
data = {('California', 2000): 33871648, ('California', 2010): 37253956, ('Texas', 2000): 20851820, ('Texas', 2010): 25145561, ('New York', 2000): 18976457, ('New York', 2010): 19378102} pd.Series(data)
data = {('California', 2000): 33871648, ('California', 2010): 37253956, ('Texas', 2000): 20851820, ('Texas', 2010): 25145561, ('New York', 2000): 18976457, ('New York', 2010): 19378102} pd.Series(data)
pd.MultiIndex.from_arrays([[‘a’, ‘a’, ‘b’, ‘b’], [1, 2, 1, 2]])
you can construct the MultiIndex from a simple list of arrays giving the index values within each level:
pd.MultiIndex.from_tuples([(‘a’, 1), (‘a’, 2), (‘b’, 1), (‘b’, 2)])
You can construct it from a list of tuples giving the multiple index values of each point:
pd.MultiIndex.from_product([[‘a’, ‘b’], [1, 2]])
You can even construct it from a Cartesian product of single indices:
pd.MultiIndex(levels=[[‘a’, ‘b’], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Similarly, you can construct the MultiIndex directly using its internal encoding by passing levels (a list of lists containing available index values for each level) and labels (a list of lists that reference these labels):
pop.index.names = [‘state’, ‘year’]
pop
Sometimes it is convenient to name the levels of the MultiIndex. This can be accomplished by passing the names argument to any of the above MultiIndex constructors, or by setting the names attribute of the index after the fact:
# hierarchical indices and columns index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names=['year', 'visit']) columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']], names=['subject', 'type'])
In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well. Consider the following, which is a mock-up of some (somewhat realistic) medical data:
Here we see where the multi-indexing for both rows and columns can come in very handy. This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number. With this in place we can, for example, index the top-level column by the person’s name and get a full DataFrame containing just that person’s information:
Benefit of multi dimensional indexing across columns and rows
indexing
pop[‘California’, 2000]
We can access single elements by indexing with multiple terms: