如何在Python的pandas中对groupby结果执行函数?
我用这段代码来计算每个用户在每个组里的不同质量指标的值。
>>> for name, group in df.groupby(["Cluster_id", "User"]):
... print 'group name:', name
... print 'group rows:'
... print group
... print 'counts of Quality values:'
... print group["Quality"].value_counts()
... raw_input()
...
但是现在我得到的输出是:
group rows:
tag user quality cluster
676 black fabric http://steve.nl/user_1002 usefulness-useful 1
708 blond wood http://steve.nl/user_1002 usefulness-useful 1
709 blond wood http://steve.nl/user_1002 problematic-misspelling 1
1410 eames? http://steve.nl/user_1002 usefulness-not_useful 1
1411 eames? http://steve.nl/user_1002 problematic-misperception 1
3649 rocking chair http://steve.nl/user_1002 usefulness-useful 1
3650 rocking chair http://steve.nl/user_1002 problematic-misperception 1
counts of Quality Values:
usefulness-useful 3
problematic-misperception 2
usefulness-not_useful 1
problematic-misspelling 1
我现在想做的是添加一个检查条件,也就是:
if quality==usefulness-useful:
good = good + 1
else:
bad = bad + 1
我尝试把输出:
counts of Quality Values:
usefulness-useful 3
problematic-misperception 2
usefulness-not_useful 1
problematic-misspelling 1
放到一个变量里,然后逐行遍历这个变量,但这并没有成功。有没有人能给我一些建议,告诉我怎么对某些行进行计算。
1 个回答
4
一旦你有了一个数据组,你可以用 .iterrows()
方法逐行遍历。这个方法会给你每一行的索引和对应的行数据:
In [33]: for row_number, row in group.iterrows():
....: print row_number
....: print row
....:
676
Tag black fabric
User http://steve.nl/user_1002
Quality usefulness-useful
Cluster_id 1
Name: 676
708
Tag blond wood
User http://steve.nl/user_1002
Quality usefulness-useful
Cluster_id 1
Name: 708
[etc]
而且这些行数据可以像字典一样进行索引,比如:
In [48]: row
Out[48]:
Tag rocking chair
User http://steve.nl/user_1002
Quality problematic-misperception
Cluster_id 1
Name: 3650
In [49]: row["User"]
Out[49]: 'http://steve.nl/user_1002'
In [50]: row["Tag"]
Out[50]: 'rocking chair'
所以你可以这样写你的循环:
good = 0
bad = 0
for row_number, row in group.iterrows():
if row['Quality'] == 'usefulness-useful':
good += 1
else:
bad += 1
print 'good', good, 'bad', bad
这样就能得到:
good 3 bad 4
如果这个方法对你来说很清楚,那就没问题。另一种方法是直接从 Quality
列的计数入手:
In [54]: counts = group["Quality"].value_counts()
In [55]: counts
Out[55]:
usefulness-useful 3
problematic-misperception 2
usefulness-not_useful 1
problematic-misspelling 1
In [56]: counts['usefulness-useful']
Out[56]: 3
而且因为坏的数量 = 总数 - 好的数量,所以我们有:
In [57]: counts.sum() - counts['usefulness-useful']
Out[57]: 4