1. 缺失信息的统计
缺失数据可以使用 isna
或 isnull
(两个函数没有区别)来查看每个单元格是否缺失,结合 mean
可以计算出每列缺失值的比例:
In [3]: df = pd.read_csv('data/learn_pandas.csv',
...: usecols = ['Grade', 'Name', 'Gender', 'Height',
...: 'Weight', 'Transfer'])
...:
In [4]: df.isna().head()
Out[4]:
Grade Name Gender Height Weight Transfer
0 False False False False False False
1 False False False False False False
2 False False False False False False
3 False False False True False False
4 False False False False False False
In [5]: df.isna().mean() # 查看缺失的比例
Out[5]:
Grade 0.000
Name 0.000
Gender 0.000
Height 0.085
Weight 0.055
Transfer 0.060
dtype: float64
如果想要查看某一列缺失或者非缺失的行,可以利用 Series
上的 isna
或者 notna
进行布尔索引。例如,查看身高缺失的行:
In [6]: df[df.Height.isna()].head()
Out[6]:
Grade Name Gender Height Weight Transfer
3 Sophomore Xiaojuan Sun Female NaN 41.0 N
12 Senior Peng You Female NaN 48.0 NaN
26 Junior Yanli You Female NaN 48.0 N
36 Freshman Xiaojuan Qin Male NaN 79.0 Y
60 Freshman Yanpeng Lv Male NaN 65.0 N
如果想要同时对几个列,检索出全部为缺失或者至少有一个缺失或者没有缺失的行,可以使用 isna, notna
和 any, all
的组合。例如,对身高、体重和转系情况这3列分别进行这三种情况的检索:
In [7]: sub_set = df[['Height', 'Weight', 'Transfer']]
In [8]: df[sub_set.isna().all(1)] # 全部缺失
Out[8]:
Grade Name Gender Height Weight Transfer
102 Junior Chengli Zhao Male NaN NaN NaN
In [9]: df[sub_set.isna().any(1)].head() # 至少有一个缺失
Out[9]:
Grade Name Gender Height Weight Transfer
3 Sophomore Xiaojuan Sun Female NaN 41.0 N
9 Junior Juan Xu Female 164.8 NaN N
12 Senior Peng You Female NaN 48.0 NaN
21 Senior Xiaopeng Shen Male 166.0 62.0 NaN
26 Junior Yanli You Female NaN 48.0 N
In [10]: df[sub_set.notna().all(1)].head() # 没有缺失
Out[10]:
Grade Name Gender Height Weight Transfer
0 Freshman Gaopeng Yang Female 158.9 46.0 N
1 Freshman Changqiang You Male 166.5 70.0 N
2 Senior Mei Sun Male 188.9 89.0 N
4 Sophomore Gaojuan You Male 174.0 74.0 N
5 Freshman Xiaoli Qian Female 158.0 51.0 N
2. 缺失信息的删除
数据处理中经常需要根据缺失值的大小、比例或其他特征来进行行样本或列特征的删除, pandas
中提供了 dropna
函数来进行操作。
dropna
的主要参数为轴方向 axis
(默认为0,即删除行)、删除方式 how
、删除的非缺失值个数阈值 thresh
( 非缺失值 没有达到这个数量的相应维度会被删除)、备选的删除子集 subset
,其中 how
主要有 any
和 all
两种参数可以选择。
例如,删除身高体重至少有一个缺失的行:
In [11]: res = df.dropna(how = 'any', subset = ['Height', 'Weight'])
In [12]: res.shape
Out[12]: (174, 6)
例如,删除超过15个缺失值的列:
In [13]: res = df.dropna(1, thresh=df.shape[0]-15) # 身高被删除
In [14]: res.head()
Out[14]:
Grade Name Gender Weight Transfer
0 Freshman Gaopeng Yang Female 46.0 N
1 Freshman Changqiang You Male 70.0 N
2 Senior Mei Sun Male 89.0 N
3 Sophomore Xiaojuan Sun Female 41.0 N
4 Sophomore Gaojuan You Male 74.0 N
当然,不用 dropna
同样是可行的,例如上述的两个操作,也可以使用布尔索引来完成:
In [15]: res = df.loc[df[['Height', 'Weight']].notna().all(1)]
In [16]: res.shape
Out[16]: (174, 6)
In [17]: res = df.loc[:, ~(df.isna().sum()>15)]
In [18]: res.head()
Out[18]:
Grade Name Gender Weight Transfer
0 Freshman Gaopeng Yang Female 46.0 N
1 Freshman Changqiang You Male 70.0 N
2 Senior Mei Sun Male 89.0 N
3 Sophomore Xiaojuan Sun Female 41.0 N
4 Sophomore Gaojuan You Male 74.0 N