第八章 文本数据(参考答案)

Ex1:房屋信息数据集

In [157]: df = pd.read_excel('data/house_info.xls', usecols=[
   .....:                 'floor','year','area','price'])
   .....: 
In [158]: df.year = pd.to_numeric(df.year.str[:-2]).astype('Int64')
In [159]: df.head(3)
Out[159]: 
      floor  year    area price
0   高层(共6层)  1986  58.23㎡  155万
1  中层(共20层)  2020     88㎡  155万
2  低层(共28层)  2010  89.33㎡  365万
In [160]: pat = '(\w层)(共(\d+)层)'
In [161]: new_cols = df.floor.str.extract(pat).rename(
   .....:                     columns={0:'Level', 1:'Highest'})
   .....: 
In [162]: df = pd.concat([df.drop(columns=['floor']), new_cols], 1)
In [163]: df.head(3)
Out[163]: 
   year    area price Level Highest
0  1986  58.23㎡  155万    高层       6
1  2020     88㎡  155万    中层      20
2  2010  89.33㎡  365万    低层      28
In [164]: s_area = pd.to_numeric(df.area.str[:-1])
In [165]: s_price = pd.to_numeric(df.price.str[:-1])
In [166]: df['avg_price'] = ((s_price/s_area)*10000).astype(
   .....:                     'int').astype('string') + '元/平米'
   .....: 
In [167]: df.head(3)
Out[167]: 
   year    area price Level Highest  avg_price
0  1986  58.23㎡  155万    高层       6  26618元/平米
1  2020     88㎡  155万    中层      20  17613元/平米
2  2010  89.33㎡  365万    低层      28  40859元/平米

Ex2:《权力的游戏》剧本数据集

In [168]: df = pd.read_csv('data/script.csv')
In [169]: df.columns = df.columns.str.strip()
In [170]: df.groupby(['Season', 'Episode'])['Sentence'].count().head()
Out[170]: 
Season    Episode   
Season 1  Episode 1     327
          Episode 10    266
          Episode 2     283
          Episode 3     353
          Episode 4     404
Name: Sentence, dtype: int64
In [171]: df.set_index('Name').Sentence.str.split().str.len(
   .....:  ).groupby('Name').mean().sort_values(ascending=False).head()
   .....: 
Out[171]: 
Name
male singer          109.000000
slave owner           77.000000
manderly              62.000000
lollys stokeworth     62.000000
dothraki matron       56.666667
Name: Sentence, dtype: float64
In [172]: s = pd.Series(df.Sentence.values, index=df.Name.shift(-1))
In [173]: s.str.count('\?').groupby('Name').sum().sort_values(ascending=False).head()
Out[173]: 
Name
tyrion lannister    527
jon snow            374
jaime lannister     283
arya stark          265
cersei lannister    246
dtype: int64