Objective and Motivation
The MultiIndex API has been gaining popularity over the years, however, not everything about it is fully understood in terms of the structure, working, and associated operations.
One important operation is filtering. Filtering is a common requirement, but the use cases are diverse. Accordingly, certain methods and functions will be more applicable to some use cases than others.
In summary, the aim of this post is to touch upon some common filtering problems and use cases, demonstrate various different methods to solve these problems, and discuss their applicability. Some of the high-level questions this post seeks to address are
- Slicing based on a single value/label
- Slicing based on multiple labels from one or more levels
- Filtering on boolean conditions and expressions
- Which methods are applicable in what circumstances
These problems have been broken down into 6 concrete questions, enumerated below. For simplicity, the example DataFrames in the setup below only have two levels, and do not have duplicate index keys. Most solutions presented to the problems can generalize to N levels.
This post will not go through how to create MultiIndexes, how to perform assignment operations on them, or any performance related discussions (these are separate topics for another time).
Questions
Question 1-6 will be asked in context to the setup below.
mux = pd.MultiIndex.from_arrays([ list('aaaabbbbbccddddd'), list('tuvwtuvwtuvwtuvw') ], names=['one', 'two']) df = pd.DataFrame({'col': np.arange(len(mux))}, mux) col one two a t 0 u 1 v 2 w 3 b t 4 u 5 v 6 w 7 t 8 c u 9 v 10 d w 11 t 12 u 13 v 14 w 15
Question 1: Selecting a Single Item
How do I select rows having "a" in level "one"?
col
one two
a t 0
u 1
v 2
w 3
Additionally, how would I be able to drop level "one" in the output?
col
two
t 0
u 1
v 2
w 3
Question 1b
How do I slice all rows with value "t" on level "two"?
col
one two
a t 0
b t 4
t 8
d t 12
Question 2: Selecting Multiple Values in a Level
How can I select rows corresponding to items "b" and "d" in level "one"?
col
one two
b t 4
u 5
v 6
w 7
t 8
d w 11
t 12
u 13
v 14
w 15
Question 2b
How would I get all values corresponding to "t" and "w" in level "two"?
col
one two
a t 0
w 3
b t 4
w 7
t 8
d w 11
t 12
w 15
Question 3: Slicing a Single Cross Section (x, y)
How do I retrieve a cross section, i.e., a single row having a specific values for the index from df? Specifically, how do I retrieve the cross section of ('c', 'u'), given by
col
one two
c u 9
Question 4: Slicing Multiple Cross Sections [(a, b), (c, d), ...]
How do I select the two rows corresponding to ('c', 'u'), and ('a', 'w')?
col
one two
c u 9
a w 3
Question 5: One Item Sliced per Level
How can I retrieve all rows corresponding to "a" in level "one" and "u" in level "two"?
col
one two
a t 0
u 1
v 2
w 3
b t 4
t 8
d t 12
Question 6: Arbitrary Slicing
How can I slice specific cross sections? For "a" and "b", I would like to select all rows with sub-levels "u" and "v", and for "d", I would like to select rows with sub-level "w".
col
one two
a u 1
v 2
b u 5
v 6
d w 11
w 15
Question 7 will use a unique setup consisting of a numeric level:
np.random.seed(0) mux2 = pd.MultiIndex.from_arrays([ list('aaaabbbbbccddddd'), np.random.choice(10, size=16) ], names=['one', 'two']) df2 = pd.DataFrame({'col': np.arange(len(mux2))}, mux2) col one two a 5 0 0 1 3 2 3 3 b 7 4 9 5 3 6 5 7 2 8 c 4 9 7 10 d 6 11 8 12 8 13 1 14 6 15
Question 6: Inequality-based filtering on Numeric Levels
How do I get all rows where values in level "two" are greater than 5?
col
one two
b 7 4
9 5
c 7 10
d 6 11
8 12
8 13
6 15
from How do I slice or filter MultiIndex DataFrame levels?
No comments:
Post a Comment