Ex1:汽车数据集
现有一份关于汽车的数据集,其中 Brand, Disp., HP
分别代表汽车品牌、发动机蓄量、发动机输出。
In [74]: df = pd.read_csv('data/car.csv')
In [75]: df.head(3)
Out[75]:
Brand Price Country Reliability Mileage Type Weight Disp. HP
0 Eagle Summit 4 8895 USA 4.0 33 Small 2560 97 113
1 Ford Escort 4 7402 USA 2.0 33 Small 2345 114 90
2 Ford Festiva 4 6319 Korea 4.0 37 Small 1845 81 63
In [76]: df.groupby('Country').filter(lambda x:x.shape[0]>2).groupby(
....: 'Country')['Price'].agg([(
....: 'CoV', lambda x: x.std()/x.mean()), 'mean', 'count'])
....:
Out[76]:
CoV mean count
Country
Japan 0.387429 13938.052632 19
Japan/USA 0.240040 10067.571429 7
Korea 0.243435 7857.333333 3
USA 0.203344 12543.269231 26
In [77]: df.shape[0]
Out[77]: 60
In [78]: condition = ['Head']*20+['Mid']*20+['Tail']*20
In [79]: df.groupby(condition)['Price'].mean()
Out[79]:
Head 9069.95
Mid 13356.40
Tail 15420.65
Name: Price, dtype: float64
In [80]: res = df.groupby('Type').agg({'Price': ['max'], 'HP': ['min']})
In [81]: res.columns = res.columns.map(lambda x:'_'.join(x))
In [82]: res
Out[82]:
Price_max HP_min
Type
Compact 18900 95
Large 17257 150
Medium 24760 110
Small 9995 63
Sporty 13945 92
Van 15395 106
In [83]: def normalize(s):
....: s_min, s_max = s.min(), s.max()
....: res = (s - s_min)/(s_max - s_min)
....: return res
....:
In [84]: df.groupby('Type')['HP'].transform(normalize).head()
Out[84]:
0 1.00
1 0.54
2 0.00
3 0.58
4 0.80
Name: HP, dtype: float64
In [85]: df.groupby('Type')[['HP', 'Disp.']].apply(
....: lambda x:np.corrcoef(x['HP'].values, x['Disp.'].values)[0,1])
....:
Out[85]:
Type
Compact 0.586087
Large -0.242765
Medium 0.370491
Small 0.603916
Sporty 0.871426
Van 0.819881
dtype: float64
Ex2:实现transform函数
In [86]: class my_groupby:
....: def __init__(self, my_df, group_cols):
....: self.my_df = my_df.copy()
....: self.groups = my_df[group_cols].drop_duplicates()
....: if isinstance(self.groups, pd.Series):
....: self.groups = self.groups.to_frame()
....: self.group_cols = self.groups.columns.tolist()
....: self.groups = {i: self.groups[i].values.tolist(
....: ) for i in self.groups.columns}
....: self.transform_col = None
....: def __getitem__(self, col):
....: self.pr_col = [col] if isinstance(col, str) else list(col)
....: return self
....: def transform(self, my_func):
....: self.num = len(self.groups[self.group_cols[0]])
....: L_order, L_value = np.array([]), np.array([])
....: for i in range(self.num):
....: group_df = self.my_df.reset_index().copy()
....: for col in self.group_cols:
....: group_df = group_df[group_df[col]==self.groups[col][i]]
....: group_df = group_df[self.pr_col]
....: if group_df.shape[1] == 1:
....: group_df = group_df.iloc[:, 0]
....: group_res = my_func(group_df)
....: if not isinstance(group_res, pd.Series):
....: group_res = pd.Series(group_res,
....: index=group_df.index,
....: name=group_df.name)
....: L_order = np.r_[L_order, group_res.index]
....: L_value = np.r_[L_value, group_res.values]
....: self.res = pd.Series(pd.Series(L_value, index=L_order).sort_index(
....: ).values,index=self.my_df.reset_index(
....: ).index, name=my_func.__name__)
....: return self.res
....:
In [87]: my_groupby(df, 'Type')
Out[87]: <__main__.my_groupby at 0x26402866c08>
单列分组:
In [88]: def f(s):
....: res = (s-s.min())/(s.max()-s.min())
....: return res
....:
In [89]: my_groupby(df, 'Type')['Price'].transform(f).head()
Out[89]:
0 0.733592
1 0.372003
2 0.109712
3 0.186244
4 0.177525
Name: f, dtype: float64
In [90]: df.groupby('Type')['Price'].transform(f).head()
Out[90]:
0 0.733592
1 0.372003
2 0.109712
3 0.186244
4 0.177525
Name: Price, dtype: float64
多列分组:
In [91]: my_groupby(df, ['Type','Country'])['Price'].transform(f).head()
Out[91]:
0 1.000000
1 0.000000
2 0.000000
3 0.000000
4 0.196357
Name: f, dtype: float64
In [92]: df.groupby(['Type','Country'])['Price'].transform(f).head()
Out[92]:
0 1.000000
1 0.000000
2 0.000000
3 0.000000
4 0.196357
Name: Price, dtype: float64
标量广播:
In [93]: my_groupby(df, 'Type')['Price'].transform(lambda x:x.mean()).head()
Out[93]:
0 7682.384615
1 7682.384615
2 7682.384615
3 7682.384615
4 7682.384615
Name: <lambda>, dtype: float64
In [94]: df.groupby('Type')['Price'].transform(lambda x:x.mean()).head()
Out[94]:
0 7682.384615
1 7682.384615
2 7682.384615
3 7682.384615
4 7682.384615
Name: Price, dtype: float64
跨列计算:
In [95]: my_groupby(df, 'Type')['Disp.', 'HP'].transform(
....: lambda x: x['Disp.']/x.HP).head()
....:
Out[95]:
0 0.858407
1 1.266667
2 1.285714
3 0.989130
4 1.097087
Name: <lambda>, dtype: float64