二、多级索引

1. 多级索引及其表的结构

为了更加清晰地说明具有多级索引的 DataFrame 结构,下面新构造一张表,读者可以忽略这里的构造方法,它们将会在第4小节被更详细地讲解。

In [70]: np.random.seed(0)
In [71]: multi_index = pd.MultiIndex.from_product([list('ABCD'),
   ....:               df.Gender.unique()], names=('School', 'Gender'))
   ....: 
In [72]: multi_column = pd.MultiIndex.from_product([['Height', 'Weight'],
   ....:                df.Grade.unique()], names=('Indicator', 'Grade'))
   ....: 
In [73]: df_multi = pd.DataFrame(np.c_[(np.random.randn(8,4)*5 + 163).tolist(),
   ....:                               (np.random.randn(8,4)*5 + 65).tolist()],
   ....:                         index = multi_index,
   ....:                         columns = multi_column).round(1)
   ....: 
In [74]: df_multi
Out[74]: 
Indicator       Height                           Weight                
Grade         Freshman Senior Sophomore Junior Freshman Senior Sophomore Junior
School Gender                                                          
A      Female    171.8  165.0     167.9  174.2     60.6   55.1      63.3   65.8
       Male      172.3  158.1     167.8  162.2     71.2   71.0      63.1   63.5
B      Female    162.5  165.1     163.7  170.3     59.8   57.9      56.5   74.8
       Male      166.8  163.6     165.2  164.7     62.5   62.8      58.7   68.9
C      Female    170.5  162.0     164.6  158.7     56.9   63.9      60.5   66.9
       Male      150.2  166.3     167.3  159.3     62.4   59.1      64.9   67.1
D      Female    174.3  155.7     163.2  162.1     65.3   66.5      61.8   63.2
       Male      170.7  170.3     163.8  164.9     61.6   63.2      60.9   56.4

下图通过颜色区分,标记了 DataFrame 的结构。与单层索引的表一样,具备元素值、行索引和列索引三个部分。其中,这里的行索引和列索引都是 MultiIndex 类型,只不过 索引中的一个元素是元组 而不是单层索引中的标量。例如,行索引的第四个元素为 ("B", "Male") ,列索引的第二个元素为 ("Height", "Senior") ,这里需要注意,外层连续出现相同的值时,第一次之后出现的会被隐藏显示,使结果的可读性增强。

与单层索引类似, MultiIndex 也具有名字属性,图中的 SchoolGender 分别对应了表的第一层和第二层行索引的名字, IndicatorGrade 分别对应了第一层和第二层列索引的名字。

索引的名字和值属性分别可以通过 namesvalues 获得:

In [75]: df_multi.index.names
Out[75]: FrozenList(['School', 'Gender'])
In [76]: df_multi.columns.names
Out[76]: FrozenList(['Indicator', 'Grade'])
In [77]: df_multi.index.values
Out[77]: 
array([('A', 'Female'), ('A', 'Male'), ('B', 'Female'), ('B', 'Male'),
       ('C', 'Female'), ('C', 'Male'), ('D', 'Female'), ('D', 'Male')],
      dtype=object)
In [78]: df_multi.columns.values
Out[78]: 
array([('Height', 'Freshman'), ('Height', 'Senior'),
       ('Height', 'Sophomore'), ('Height', 'Junior'),
       ('Weight', 'Freshman'), ('Weight', 'Senior'),
       ('Weight', 'Sophomore'), ('Weight', 'Junior')], dtype=object)

如果想要得到某一层的索引,则需要通过 get_level_values 获得:

In [79]: df_multi.index.get_level_values(0)
Out[79]: Index(['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'], dtype='object', name='School')

但对于索引而言,无论是单层还是多层,用户都无法通过 index_obj[0] = item 的方式来修改元素,也不能通过 index_name[0] = new_name 的方式来修改名字,关于如何修改这些属性的话题将在第三节被讨论。

2. 多级索引中的loc索引器

熟悉了结构后,现在回到原表,将学校和年级设为索引,此时的行为多级索引,列为单级索引,由于默认状态的列索引不含名字,因此对应于刚刚图中 IndicatorGrade 的索引名位置是空缺的。

In [80]: df_multi = df.set_index(['School', 'Grade'])
In [81]: df_multi.head()
Out[81]: 
                                                   Name  Gender  Weight Transfer
School                        Grade                                     
Shanghai Jiao Tong University Freshman     Gaopeng Yang  Female    46.0        N
Peking University             Freshman   Changqiang You    Male    70.0        N
Shanghai Jiao Tong University Senior            Mei Sun    Male    89.0        N
Fudan University              Sophomore    Xiaojuan Sun  Female    41.0        N
                              Sophomore     Gaojuan You    Male    74.0        N

由于多级索引中的单个元素以元组为单位,因此之前在第一节介绍的 lociloc 方法完全可以照搬,只需把标量的位置替换成对应的元组。

当传入元组列表或单个元组或返回前二者的函数时,需要先进行索引排序以避免性能警告:

In [82]: with warnings.catch_warnings():
   ....:     warnings.filterwarnings('error')
   ....:     try:
   ....:         df_multi.loc[('Fudan University', 'Junior')].head()
   ....:     except Warning as w:
   ....:         Warning_Msg = w
   ....: 
In [83]: Warning_Msg
Out[83]: pandas.errors.PerformanceWarning('indexing past lexsort depth may impact performance.')
In [84]: df_sorted = df_multi.sort_index()
In [85]: df_sorted.loc[('Fudan University', 'Junior')].head()
Out[85]: 
                                  Name  Gender  Weight Transfer
School           Grade                                 
Fudan University Junior      Yanli You  Female    48.0        N
                 Junior  Chunqiang Chu    Male    72.0        N
                 Junior   Changfeng Lv    Male    76.0        N
                 Junior     Yanjuan Lv  Female    49.0      NaN
                 Junior  Gaoqiang Zhou  Female    43.0        N
In [86]: df_sorted.loc[[('Fudan University', 'Senior'),
   ....:               ('Shanghai Jiao Tong University', 'Freshman')]].head()
   ....: 
Out[86]: 
                                    Name  Gender  Weight Transfer
School           Grade                                   
Fudan University Senior  Chengpeng Zheng  Female    38.0        N
                 Senior        Feng Zhou  Female    47.0        N
                 Senior        Gaomei Lv  Female    34.0        N
                 Senior        Chunli Lv  Female    56.0        N
                 Senior   Chengpeng Zhou    Male    81.0        N
In [87]: df_sorted.loc[df_sorted.Weight > 70].head() # 布尔列表也是可用的
Out[87]: 
                                     Name Gender  Weight Transfer
School           Grade                                   
Fudan University Freshman       Feng Wang   Male    74.0        N
                 Junior     Chunqiang Chu   Male    72.0        N
                 Junior      Changfeng Lv   Male    76.0        N
                 Senior    Chengpeng Zhou   Male    81.0        N
                 Senior    Chengpeng Qian   Male    73.0        Y
In [88]: df_sorted.loc[lambda x:('Fudan University','Junior')].head()
Out[88]: 
                                  Name  Gender  Weight Transfer
School           Grade                                 
Fudan University Junior      Yanli You  Female    48.0        N
                 Junior  Chunqiang Chu    Male    72.0        N
                 Junior   Changfeng Lv    Male    76.0        N
                 Junior     Yanjuan Lv  Female    49.0      NaN
                 Junior  Gaoqiang Zhou  Female    43.0        N

当使用切片时需要注意,在单级索引中只要切片端点元素是唯一的,那么就可以进行切片,但在多级索引中,无论元组在索引中是否重复出现,都必须经过排序才能使用切片,否则报错:

In [89]: try:
   ....:     df_multi.loc[('Fudan University', 'Senior'):].head()
   ....: except Exception as e:
   ....:     Err_Msg = e
   ....: 
In [90]: Err_Msg
Out[90]: pandas.errors.UnsortedIndexError('Key length (2) was greater than MultiIndex lexsort depth (0)')
In [91]: df_sorted.loc[('Fudan University', 'Senior'):].head()
Out[91]: 
                                    Name  Gender  Weight Transfer
School           Grade                                   
Fudan University Senior  Chengpeng Zheng  Female    38.0        N
                 Senior        Feng Zhou  Female    47.0        N
                 Senior        Gaomei Lv  Female    34.0        N
                 Senior        Chunli Lv  Female    56.0        N
                 Senior   Chengpeng Zhou    Male    81.0        N
In [92]: df_unique = df.drop_duplicates(subset=['School','Grade']
   ....:                               ).set_index(['School', 'Grade'])
   ....: 
In [93]: df_unique.head()
Out[93]: 
                                                   Name  Gender  Weight Transfer
School                        Grade                                     
Shanghai Jiao Tong University Freshman     Gaopeng Yang  Female    46.0        N
Peking University             Freshman   Changqiang You    Male    70.0        N
Shanghai Jiao Tong University Senior            Mei Sun    Male    89.0        N
Fudan University              Sophomore    Xiaojuan Sun  Female    41.0        N
Tsinghua University           Freshman      Xiaoli Qian  Female    51.0        N
In [94]: try:
   ....:     df_unique.loc[('Fudan University', 'Senior'):].head()
   ....: except Exception as e:
   ....:     Err_Msg = e
   ....: 
In [95]: Err_Msg
Out[95]: pandas.errors.UnsortedIndexError('Key length (2) was greater than MultiIndex lexsort depth (0)')
In [96]: df_unique.sort_index().loc[('Fudan University', 'Senior'):].head()
Out[96]: 
                                        Name  Gender  Weight Transfer
School            Grade                                      
Fudan University  Senior     Chengpeng Zheng  Female    38.0        N
                  Sophomore     Xiaojuan Sun  Female    41.0        N
Peking University Freshman    Changqiang You    Male    70.0        N
                  Junior             Juan Xu  Female     NaN        N
                  Senior          Changli Lv  Female    41.0        N

此外,在多级索引中的元组有一种特殊的用法,可以对多层的元素进行交叉组合后索引,但同时需要指定 loc 的列,全选则用 : 表示。其中,每一层需要选中的元素用列表存放,传入 loc 的形式为 [(level_0_list, level_1_list), cols] 。例如,想要得到所有北大和复旦的大二大三学生,可以如下写出:

In [97]: res = df_multi.loc[(['Peking University', 'Fudan University'],
   ....:                     ['Sophomore', 'Junior']), :]
   ....: 
In [98]: res.head()
Out[98]: 
                                     Name  Gender  Weight Transfer
School            Grade                                   
Peking University Sophomore   Changmei Xu  Female    43.0        N
                  Sophomore  Xiaopeng Qin    Male     NaN        N
                  Sophomore        Mei Xu  Female    39.0        N
                  Sophomore   Xiaoli Zhou  Female    55.0        N
                  Sophomore      Peng Han  Female    34.0      NaN
In [99]: res.shape
Out[99]: (33, 4)

下面的语句和上面类似,但仍然传入的是元素(这里为元组)的列表,它们的意义是不同的,表示的是选出北大的大三学生和复旦的大二学生:

In [100]: res = df_multi.loc[[('Peking University', 'Junior'),
   .....:                     ('Fudan University', 'Sophomore')]]
   .....: 
In [101]: res.head()
Out[101]: 
                                   Name  Gender  Weight Transfer
School            Grade                                 
Peking University Junior        Juan Xu  Female     NaN        N
                  Junior  Changjuan You  Female    47.0        N
                  Junior       Gaoli Xu  Female    48.0        N
                  Junior   Gaoquan Zhou    Male    70.0        N
                  Junior      Qiang You  Female    56.0        N
In [102]: res.shape
Out[102]: (16, 4)

3. IndexSlice对象

前面介绍的方法,即使在索引不重复的时候,也只能对元组整体进行切片,而不能对每层进行切片,也不允许将切片和布尔列表混合使用,引入 IndexSlice 对象就能解决这个问题。 Slice 对象一共有两种形式,第一种为 loc[idx[*,*]] 型,第二种为 loc[idx[*,*],idx[*,*]] 型,下面将进行介绍。为了方便演示,下面构造一个 索引不重复的 DataFrame

In [103]: np.random.seed(0)
In [104]: L1,L2 = ['A','B','C'],['a','b','c']
In [105]: mul_index1 = pd.MultiIndex.from_product([L1,L2],names=('Upper', 'Lower'))
In [106]: L3,L4 = ['D','E','F'],['d','e','f']
In [107]: mul_index2 = pd.MultiIndex.from_product([L3,L4],names=('Big', 'Small'))
In [108]: df_ex = pd.DataFrame(np.random.randint(-9,10,(9,9)),
   .....:                     index=mul_index1,
   .....:                     columns=mul_index2)
   .....: 
In [109]: df_ex
Out[109]: 
Big          D        E        F  
Small        d  e  f  d  e  f  d  e  f
Upper Lower                   
A     a      3  6 -9 -6 -6 -2  0  9 -5
      b     -3  3 -8 -3 -2  5  8 -4  4
      c     -1  0  7 -4  6  6 -9  9 -6
B     a      8  5 -2 -9 -8  0 -9  1 -6
      b      2  9 -7 -9 -9 -5 -4 -3 -1
      c      8  6 -5  0  1 -8 -8 -2  0
C     a     -6 -3  2  5  9 -9  5 -6  3
      b      1  2 -5 -3 -5  6 -6  3 -5
      c     -1  5  6 -6  6  4  7  8 -4

为了使用 silce 对象,先要进行定义:

In [110]: idx = pd.IndexSlice

【a】 loc[idx[*,*]]

这种情况并不能进行多层分别切片,前一个 * 表示行的选择,后一个 * 表示列的选择,与单纯的 loc 是类似的:

In [111]: df_ex.loc[idx['C':, ('D', 'f'):]]
Out[111]: 
Big          D  E        F  
Small        f  d  e  f  d  e  f
Upper Lower             
C     a      2  5  9 -9  5 -6  3
      b     -5 -3 -5  6 -6  3 -5
      c      6 -6  6  4  7  8 -4

另外,也支持布尔序列的索引

In [112]: df_ex.loc[idx[:'A', lambda x:x.sum()>0]] # 列和大于0
Out[112]: 
Big          D     F
Small        d  e  e
Upper Lower   
A     a      3  6  9
      b     -3  3 -4
      c     -1  0  9

【b】 loc[idx[*,*],idx[*,*]]

这种情况能够分层进行切片,前一个 idx 指代的是行索引,后一个是列索引。

In [113]: df_ex.loc[idx[:'A', 'b':], idx['E':, 'e':]]
Out[113]: 
Big          E     F   
Small        e  f  e  f
Upper Lower    
A     b     -2  5 -4  4
      c      6  6  9 -6

但需要注意的是,此时不支持使用函数:

In [114]: try:
   .....:     df_ex.loc[idx[:'A', lambda x: 'b'], idx['E':, 'e':]]
   .....: except Exception as e:
   .....:     Err_Msg = e
   .....: 
In [115]: Err_Msg
Out[115]: KeyError(<function __main__.<lambda>(x)>)

4. 多级索引的构造

前面提到了多级索引表的结构和切片,那么除了使用 set_index 之外,如何自己构造多级索引呢?常用的有 from_tuples, from_arrays, from_product 三种方法,它们都是 pd.MultiIndex 对象下的函数。

from_tuples 指根据传入由元组组成的列表进行构造:

In [116]: my_tuple = [('a','cat'),('a','dog'),('b','cat'),('b','dog')]
In [117]: pd.MultiIndex.from_tuples(my_tuple, names=['First','Second'])
Out[117]: 
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

from_arrays 指根据传入列表中,对应层的列表进行构造:

In [118]: my_array = [list('aabb'), ['cat', 'dog']*2]
In [119]: pd.MultiIndex.from_arrays(my_array, names=['First','Second'])
Out[119]: 
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])

from_product 指根据给定多个列表的笛卡尔积进行构造:

In [120]: my_list1 = ['a','b']
In [121]: my_list2 = ['cat','dog']
In [122]: pd.MultiIndex.from_product([my_list1,
   .....:                             my_list2],
   .....:                            names=['First','Second'])
   .....: 
Out[122]: 
MultiIndex([('a', 'cat'),
            ('a', 'dog'),
            ('b', 'cat'),
            ('b', 'dog')],
           names=['First', 'Second'])