第四章 分组(参考答案)

Ex1:汽车数据集

现有一份关于汽车的数据集,其中 Brand, Disp., HP 分别代表汽车品牌、发动机蓄量、发动机输出。

In [74]: df = pd.read_csv('data/car.csv')
In [75]: df.head(3)
Out[75]: 
             Brand  Price Country  Reliability  Mileage   Type  Weight  Disp.   HP
0   Eagle Summit 4   8895     USA          4.0       33  Small    2560     97  113
1  Ford Escort   4   7402     USA          2.0       33  Small    2345    114   90
2   Ford Festiva 4   6319   Korea          4.0       37  Small    1845     81   63
In [76]: df.groupby('Country').filter(lambda x:x.shape[0]>2).groupby(
   ....:            'Country')['Price'].agg([(
   ....:            'CoV', lambda x: x.std()/x.mean()), 'mean', 'count'])
   ....: 
Out[76]: 
                CoV          mean  count
Country                           
Japan      0.387429  13938.052632     19
Japan/USA  0.240040  10067.571429      7
Korea      0.243435   7857.333333      3
USA        0.203344  12543.269231     26
In [77]: df.shape[0]
Out[77]: 60
In [78]: condition = ['Head']*20+['Mid']*20+['Tail']*20
In [79]: df.groupby(condition)['Price'].mean()
Out[79]: 
Head     9069.95
Mid     13356.40
Tail    15420.65
Name: Price, dtype: float64
In [80]: res = df.groupby('Type').agg({'Price': ['max'], 'HP': ['min']})
In [81]: res.columns = res.columns.map(lambda x:'_'.join(x))
In [82]: res
Out[82]: 
         Price_max  HP_min
Type                
Compact      18900      95
Large        17257     150
Medium       24760     110
Small         9995      63
Sporty       13945      92
Van          15395     106
In [83]: def normalize(s):
   ....:     s_min, s_max = s.min(), s.max()
   ....:     res = (s - s_min)/(s_max - s_min)
   ....:     return res
   ....: 
In [84]: df.groupby('Type')['HP'].transform(normalize).head()
Out[84]: 
0    1.00
1    0.54
2    0.00
3    0.58
4    0.80
Name: HP, dtype: float64
In [85]: df.groupby('Type')[['HP', 'Disp.']].apply(
   ....:    lambda x:np.corrcoef(x['HP'].values, x['Disp.'].values)[0,1])
   ....: 
Out[85]: 
Type
Compact    0.586087
Large     -0.242765
Medium     0.370491
Small      0.603916
Sporty     0.871426
Van        0.819881
dtype: float64

Ex2:实现transform函数

In [86]: class my_groupby:
   ....:     def __init__(self, my_df, group_cols):
   ....:         self.my_df = my_df.copy()
   ....:         self.groups = my_df[group_cols].drop_duplicates()
   ....:         if isinstance(self.groups, pd.Series):
   ....:             self.groups = self.groups.to_frame()
   ....:         self.group_cols = self.groups.columns.tolist()
   ....:         self.groups = {i: self.groups[i].values.tolist(
   ....:                        ) for i in self.groups.columns}
   ....:         self.transform_col = None
   ....:     def __getitem__(self, col):
   ....:         self.pr_col = [col] if isinstance(col, str) else list(col)
   ....:         return self
   ....:     def transform(self, my_func):
   ....:         self.num = len(self.groups[self.group_cols[0]])
   ....:         L_order, L_value = np.array([]), np.array([])
   ....:         for i in range(self.num):
   ....:             group_df = self.my_df.reset_index().copy()
   ....:             for col in self.group_cols:
   ....:                 group_df = group_df[group_df[col]==self.groups[col][i]]
   ....:             group_df = group_df[self.pr_col]
   ....:             if group_df.shape[1] == 1:
   ....:                 group_df = group_df.iloc[:, 0]
   ....:             group_res = my_func(group_df)
   ....:             if not isinstance(group_res, pd.Series):
   ....:                 group_res = pd.Series(group_res,
   ....:                                       index=group_df.index,
   ....:                                       name=group_df.name)
   ....:             L_order = np.r_[L_order, group_res.index]
   ....:             L_value = np.r_[L_value, group_res.values]
   ....:         self.res = pd.Series(pd.Series(L_value, index=L_order).sort_index(
   ....:                    ).values,index=self.my_df.reset_index(
   ....:                    ).index, name=my_func.__name__)
   ....:         return self.res
   ....: 
In [87]: my_groupby(df, 'Type')
Out[87]: <__main__.my_groupby at 0x26402866c08>

单列分组:

In [88]: def f(s):
   ....:     res = (s-s.min())/(s.max()-s.min())
   ....:     return res
   ....: 
In [89]: my_groupby(df, 'Type')['Price'].transform(f).head()
Out[89]: 
0    0.733592
1    0.372003
2    0.109712
3    0.186244
4    0.177525
Name: f, dtype: float64
In [90]: df.groupby('Type')['Price'].transform(f).head()
Out[90]: 
0    0.733592
1    0.372003
2    0.109712
3    0.186244
4    0.177525
Name: Price, dtype: float64

多列分组:

In [91]: my_groupby(df, ['Type','Country'])['Price'].transform(f).head()
Out[91]: 
0    1.000000
1    0.000000
2    0.000000
3    0.000000
4    0.196357
Name: f, dtype: float64
In [92]: df.groupby(['Type','Country'])['Price'].transform(f).head()
Out[92]: 
0    1.000000
1    0.000000
2    0.000000
3    0.000000
4    0.196357
Name: Price, dtype: float64

标量广播:

In [93]: my_groupby(df, 'Type')['Price'].transform(lambda x:x.mean()).head()
Out[93]: 
0    7682.384615
1    7682.384615
2    7682.384615
3    7682.384615
4    7682.384615
Name: <lambda>, dtype: float64
In [94]: df.groupby('Type')['Price'].transform(lambda x:x.mean()).head()
Out[94]: 
0    7682.384615
1    7682.384615
2    7682.384615
3    7682.384615
4    7682.384615
Name: Price, dtype: float64

跨列计算:

In [95]: my_groupby(df, 'Type')['Disp.', 'HP'].transform(
   ....:                lambda x: x['Disp.']/x.HP).head()
   ....: 
Out[95]: 
0    0.858407
1    1.266667
2    1.285714
3    0.989130
4    1.097087
Name: <lambda>, dtype: float64