一、缺失值的统计和删除

1. 缺失信息的统计

缺失数据可以使用 isnaisnull (两个函数没有区别)来查看每个单元格是否缺失,结合 mean 可以计算出每列缺失值的比例:

In [3]: df = pd.read_csv('data/learn_pandas.csv',
   ...:                  usecols = ['Grade', 'Name', 'Gender', 'Height',
   ...:                             'Weight', 'Transfer'])
   ...: 
In [4]: df.isna().head()
Out[4]: 
   Grade   Name  Gender  Height  Weight  Transfer
0  False  False   False   False   False     False
1  False  False   False   False   False     False
2  False  False   False   False   False     False
3  False  False   False    True   False     False
4  False  False   False   False   False     False
In [5]: df.isna().mean() # 查看缺失的比例
Out[5]: 
Grade       0.000
Name        0.000
Gender      0.000
Height      0.085
Weight      0.055
Transfer    0.060
dtype: float64

如果想要查看某一列缺失或者非缺失的行,可以利用 Series 上的 isna 或者 notna 进行布尔索引。例如,查看身高缺失的行:

In [6]: df[df.Height.isna()].head()
Out[6]: 
        Grade          Name  Gender  Height  Weight Transfer
3   Sophomore  Xiaojuan Sun  Female     NaN    41.0        N
12     Senior      Peng You  Female     NaN    48.0      NaN
26     Junior     Yanli You  Female     NaN    48.0        N
36   Freshman  Xiaojuan Qin    Male     NaN    79.0        Y
60   Freshman    Yanpeng Lv    Male     NaN    65.0        N

如果想要同时对几个列,检索出全部为缺失或者至少有一个缺失或者没有缺失的行,可以使用 isna, notnaany, all 的组合。例如,对身高、体重和转系情况这3列分别进行这三种情况的检索:

In [7]: sub_set = df[['Height', 'Weight', 'Transfer']]
In [8]: df[sub_set.isna().all(1)] # 全部缺失
Out[8]: 
      Grade          Name Gender  Height  Weight Transfer
102  Junior  Chengli Zhao   Male     NaN     NaN      NaN
In [9]: df[sub_set.isna().any(1)].head() # 至少有一个缺失
Out[9]: 
        Grade           Name  Gender  Height  Weight Transfer
3   Sophomore   Xiaojuan Sun  Female     NaN    41.0        N
9      Junior        Juan Xu  Female   164.8     NaN        N
12     Senior       Peng You  Female     NaN    48.0      NaN
21     Senior  Xiaopeng Shen    Male   166.0    62.0      NaN
26     Junior      Yanli You  Female     NaN    48.0        N
In [10]: df[sub_set.notna().all(1)].head() # 没有缺失
Out[10]: 
       Grade            Name  Gender  Height  Weight Transfer
0   Freshman    Gaopeng Yang  Female   158.9    46.0        N
1   Freshman  Changqiang You    Male   166.5    70.0        N
2     Senior         Mei Sun    Male   188.9    89.0        N
4  Sophomore     Gaojuan You    Male   174.0    74.0        N
5   Freshman     Xiaoli Qian  Female   158.0    51.0        N

2. 缺失信息的删除

数据处理中经常需要根据缺失值的大小、比例或其他特征来进行行样本或列特征的删除, pandas 中提供了 dropna 函数来进行操作。

dropna 的主要参数为轴方向 axis (默认为0,即删除行)、删除方式 how 、删除的非缺失值个数阈值 thresh非缺失值 没有达到这个数量的相应维度会被删除)、备选的删除子集 subset ,其中 how 主要有 anyall 两种参数可以选择。

例如,删除身高体重至少有一个缺失的行:

In [11]: res = df.dropna(how = 'any', subset = ['Height', 'Weight'])
In [12]: res.shape
Out[12]: (174, 6)

例如,删除超过15个缺失值的列:

In [13]: res = df.dropna(1, thresh=df.shape[0]-15) # 身高被删除
In [14]: res.head()
Out[14]: 
       Grade            Name  Gender  Weight Transfer
0   Freshman    Gaopeng Yang  Female    46.0        N
1   Freshman  Changqiang You    Male    70.0        N
2     Senior         Mei Sun    Male    89.0        N
3  Sophomore    Xiaojuan Sun  Female    41.0        N
4  Sophomore     Gaojuan You    Male    74.0        N

当然,不用 dropna 同样是可行的,例如上述的两个操作,也可以使用布尔索引来完成:

In [15]: res = df.loc[df[['Height', 'Weight']].notna().all(1)]
In [16]: res.shape
Out[16]: (174, 6)
In [17]: res = df.loc[:, ~(df.isna().sum()>15)]
In [18]: res.head()
Out[18]: 
       Grade            Name  Gender  Weight Transfer
0   Freshman    Gaopeng Yang  Female    46.0        N
1   Freshman  Changqiang You    Male    70.0        N
2     Senior         Mei Sun    Male    89.0        N
3  Sophomore    Xiaojuan Sun  Female    41.0        N
4  Sophomore     Gaojuan You    Male    74.0        N