Pandas Multi Index Flashcards

1
Q

index = pd.MultiIndex.from_tuples(index)

index

A

index = [(‘California’, 2000), (‘California’, 2010),
(‘New York’, 2000), (‘New York’, 2010),
(‘Texas’, 2000), (‘Texas’, 2010)]

creates

California  2000    33871648
                   2010    37253956
New York    2000    18976457
                    2010    19378102
Texas       2000    20851820
                  2010    25145561
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

pop = pop.reindex(index)

pop

A

reasign index

useful for multi level from tuples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

pop[:, 2010]

A

Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

pop_df = pop.unstack()

pop_df

A

You might notice something else here: we could easily have stored the same data using a simple DataFrame with index and column labels. In fact, Pandas is built with this equivalence in mind. The unstack() method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

pop_df.stack()

A

Naturally, the stack() method provides the opposite operation:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

multi index creation
df = pd.DataFrame(np.random.rand(4, 2),
index=[[‘a’, ‘a’, ‘b’, ‘b’], [1, 2, 1, 2]],
columns=[‘data1’, ‘data2’])

A

The most straightforward way to construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the constructor. For example:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)
A
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

pd.MultiIndex.from_arrays([[‘a’, ‘a’, ‘b’, ‘b’], [1, 2, 1, 2]])

A

you can construct the MultiIndex from a simple list of arrays giving the index values within each level:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

pd.MultiIndex.from_tuples([(‘a’, 1), (‘a’, 2), (‘b’, 1), (‘b’, 2)])

A

You can construct it from a list of tuples giving the multiple index values of each point:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

pd.MultiIndex.from_product([[‘a’, ‘b’], [1, 2]])

A

You can even construct it from a Cartesian product of single indices:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

pd.MultiIndex(levels=[[‘a’, ‘b’], [1, 2]],

labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

A

Similarly, you can construct the MultiIndex directly using its internal encoding by passing levels (a list of lists containing available index values for each level) and labels (a list of lists that reference these labels):

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

pop.index.names = [‘state’, ‘year’]

pop

A

Sometimes it is convenient to name the levels of the MultiIndex. This can be accomplished by passing the names argument to any of the above MultiIndex constructors, or by setting the names attribute of the index after the fact:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])
A

In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well. Consider the following, which is a mock-up of some (somewhat realistic) medical data:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Here we see where the multi-indexing for both rows and columns can come in very handy. This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number. With this in place we can, for example, index the top-level column by the person’s name and get a full DataFrame containing just that person’s information:

A

Benefit of multi dimensional indexing across columns and rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

indexing

pop[‘California’, 2000]

A

We can access single elements by indexing with multiple terms:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

pop.loc[‘California’:’New York’]

A

Partial slicing is available as well, as long as the MultiIndex is sorted

17
Q

pop[:, 2000]

A

With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index:

18
Q

pop[pop > 22000000]

A

Other types of indexing and selection (discussed in Data Indexing and Selection) work as well; for example, selection based on Boolean masks:

19
Q

pop[[‘California’, ‘Texas’]]

A

Selection based on fancy indexing also works:

20
Q

multiply indexed fram
health_data[‘Guido’, ‘HR’]
(from multi dimension columns)

A

Remember that columns are primary in a DataFrame, and the syntax used for multiply indexed Series applies to the columns. For example, we can recover Guido’s heart rate data with a simple operation:

21
Q

health_data.iloc[:2, :2]

A

Also, as with the single-index case, we can use the loc, iloc, and ix indexers introduced in Data Indexing and Selection

22
Q

health_data.loc[:, (‘Bob’, ‘HR’)]

A

These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in loc or iloc can be passed a tuple of multiple indices. For example:

23
Q

idx = pd.IndexSlice

health_data.loc[idx[:, 1], idx[:, ‘HR’]]

A

You could get around this by building the desired slice explicitly using Python’s built-in slice() function, but a better way in this context is to use an IndexSlice object, which Pandas provides for precisely this situation. For example:

24
Q

data = data.sort_index()

data

A

need to sort multi index prior to slicing operations

error if not properly sorted

25
Q

pop. unstack(level=0)

pop. unstack(level=1)

A

As we saw briefly before, it is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use:

26
Q

pop.unstack().stack()

A

The opposite of unstack() is stack(), which here can be used to recover the original series:

27
Q

pop_flat = pop.reset_index(name=’population’)

pop_flat

A

Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the reset_index method. Calling this on the population dictionary will result in a DataFrame with a state and year column holding the information that was formerly in the index. For clarity, we can optionally specify the name of the data for the column representation:

28
Q

pop_flat.set_index([‘state’, ‘year’])

A

Often when working with data in the real world, the raw input data looks like this and it’s useful to build a MultiIndex from the column values. This can be done with the set_index method of the DataFrame, which returns a multiply indexed DataFrame:

29
Q

data_mean = health_data.mean(level=’year’)

data_mean

A

We’ve previously seen that Pandas has built-in data aggregation methods, such as mean(), sum(), and max(). For hierarchically indexed data, these can be passed a level parameter that controls which subset of the data the aggregate is computed on.

30
Q

data_mean.mean(axis=1, level=’type’)

A

By further making use of the axis keyword, we can take the mean among levels on the columns as well: