Pandas Multi Index Flashcards

Question 1

Q

index = pd.MultiIndex.from_tuples(index)

index

Answer

A

index = [(‘California’, 2000), (‘California’, 2010),
(‘New York’, 2000), (‘New York’, 2010),
(‘Texas’, 2000), (‘Texas’, 2010)]

creates

California  2000    33871648
                   2010    37253956
New York    2000    18976457
                    2010    19378102
Texas       2000    20851820
                  2010    25145561

Question 2

Q

pop = pop.reindex(index)

pop

Answer

A

reasign index

useful for multi level from tuples

Question 3

Q

pop[:, 2010]

Answer

A

Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:

Question 4

Q

pop_df = pop.unstack()

pop_df

Answer

A

You might notice something else here: we could easily have stored the same data using a simple DataFrame with index and column labels. In fact, Pandas is built with this equivalence in mind. The unstack() method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame:

Question 5

Q

pop_df.stack()

Answer

A

Naturally, the stack() method provides the opposite operation:

Question 6

Q

multi index creation
df = pd.DataFrame(np.random.rand(4, 2),
index=[[‘a’, ‘a’, ‘b’, ‘b’], [1, 2, 1, 2]],
columns=[‘data1’, ‘data2’])

Answer

A

The most straightforward way to construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the constructor. For example:

Question 7

Q

data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

Answer

A

data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

Question 8

Q

pd.MultiIndex.from_arrays([[‘a’, ‘a’, ‘b’, ‘b’], [1, 2, 1, 2]])

Answer

A

you can construct the MultiIndex from a simple list of arrays giving the index values within each level:

Question 9

Q

pd.MultiIndex.from_tuples([(‘a’, 1), (‘a’, 2), (‘b’, 1), (‘b’, 2)])

Answer

A

You can construct it from a list of tuples giving the multiple index values of each point:

Question 10

Q

pd.MultiIndex.from_product([[‘a’, ‘b’], [1, 2]])

Answer

A

You can even construct it from a Cartesian product of single indices:

Question 11

Q

pd.MultiIndex(levels=[[‘a’, ‘b’], [1, 2]],

labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Answer

A

Similarly, you can construct the MultiIndex directly using its internal encoding by passing levels (a list of lists containing available index values for each level) and labels (a list of lists that reference these labels):

Question 12

Q

pop.index.names = [‘state’, ‘year’]

pop

Answer

A

Sometimes it is convenient to name the levels of the MultiIndex. This can be accomplished by passing the names argument to any of the above MultiIndex constructors, or by setting the names attribute of the index after the fact:

Question 13

Q

# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

Answer

A

In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well. Consider the following, which is a mock-up of some (somewhat realistic) medical data:

Question 14

Q

Here we see where the multi-indexing for both rows and columns can come in very handy. This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number. With this in place we can, for example, index the top-level column by the person’s name and get a full DataFrame containing just that person’s information:

Answer

A

Benefit of multi dimensional indexing across columns and rows

Question 15

Q

indexing

pop[‘California’, 2000]

Answer

A

We can access single elements by indexing with multiple terms:

Question 16

Q

pop.loc[‘California’:’New York’]

Answer

A

Partial slicing is available as well, as long as the MultiIndex is sorted

Question 17

Q

pop[:, 2000]

Answer

A

With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index:

Question 18

Q

pop[pop > 22000000]

Answer

A

Other types of indexing and selection (discussed in Data Indexing and Selection) work as well; for example, selection based on Boolean masks:

Question 19

Q

pop[[‘California’, ‘Texas’]]

Answer

A

Selection based on fancy indexing also works:

Question 20

Q

multiply indexed fram
health_data[‘Guido’, ‘HR’]
(from multi dimension columns)

Answer

A

Remember that columns are primary in a DataFrame, and the syntax used for multiply indexed Series applies to the columns. For example, we can recover Guido’s heart rate data with a simple operation:

Question 21

Q

health_data.iloc[:2, :2]

Answer

A

Also, as with the single-index case, we can use the loc, iloc, and ix indexers introduced in Data Indexing and Selection

Question 22

Q

health_data.loc[:, (‘Bob’, ‘HR’)]

Answer

A

These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in loc or iloc can be passed a tuple of multiple indices. For example:

Question 23

Q

idx = pd.IndexSlice

health_data.loc[idx[:, 1], idx[:, ‘HR’]]

Answer

A

You could get around this by building the desired slice explicitly using Python’s built-in slice() function, but a better way in this context is to use an IndexSlice object, which Pandas provides for precisely this situation. For example:

Question 24

Q

data = data.sort_index()

data

Answer

A

need to sort multi index prior to slicing operations

error if not properly sorted

Question 25

Q

pop. unstack(level=0)

pop. unstack(level=1)

Answer

A

As we saw briefly before, it is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use:

Question 26

Q

pop.unstack().stack()

Answer

A

The opposite of unstack() is stack(), which here can be used to recover the original series:

Question 27

Q

pop_flat = pop.reset_index(name=’population’)

pop_flat

Answer

A

Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the reset_index method. Calling this on the population dictionary will result in a DataFrame with a state and year column holding the information that was formerly in the index. For clarity, we can optionally specify the name of the data for the column representation:

Question 28

Q

pop_flat.set_index([‘state’, ‘year’])

Answer

A

Often when working with data in the real world, the raw input data looks like this and it’s useful to build a MultiIndex from the column values. This can be done with the set_index method of the DataFrame, which returns a multiply indexed DataFrame:

Question 29

Q

data_mean = health_data.mean(level=’year’)

data_mean

Answer

A

We’ve previously seen that Pandas has built-in data aggregation methods, such as mean(), sum(), and max(). For hierarchically indexed data, these can be passed a level parameter that controls which subset of the data the aggregate is computed on.

Question 30

Q

data_mean.mean(axis=1, level=’type’)

Answer

A

By further making use of the axis keyword, we can take the mean among levels on the columns as well: