在Python中对字典列表进行分组和聚合
我有一个字典列表,想在Python中对它们进行汇总:
data = [{"startDate": 123, "endDate": 456, "campaignName": "abc", "campaignCfid": 789, "budgetImpressions": 10},
{"startDate": 123, "endDate": 456, "campaignName": "abc", "campaignCfid": 789, "budgetImpressions": 50},
{"startDate": 456, "endDate": 789, "campaignName": "def", "campaignCfid": 123, "budgetImpressions": 80}]
我想根据budgetImpressions来进行汇总。
所以最后的结果应该是:
data = [{"startDate": 123, "endDate": 456, "campaignName": "abc", "campaignCfid": 789, "budgetImpressions": 60},
{"startDate": 456, "endDate": 789, "campaignName": "def", "campaignCfid": 123, "budgetImpressions": 80}]
需要注意的是,每个特定的campaignName对应的campaignCfid、startDate和endDate总是相同的。
这个在Python中可以做到吗?我试过用itertools,但效果不太好。用Pandas会不会更好一些?
2 个回答
5
只是想说明,有时候用Python来做这些事情是完全可以的:
In [11]: from collections import Counter
from itertools import groupby
In [12]: data = [{"startDate": 123, "endDate": 456, "campaignName": "abc", "campaignCfid": 789, "budgetImpressions": 10}, {"startDate": 123, "endDate": 456, "campaignName": "abc", "campaignCfid": 789, "budgetImpressions": 50}, {"startDate": 456, "endDate": 789, "campaignName": "def", "campaignCfid": 123, "budgetImpressions": 80}]
In [13]: g = groupby(data, lambda x: x.pop('campaignName'))
In [14]: d = {}
for campaign, campaign_data in g:
c = Counter()
for row in campaign_data: c.update(row)
d[campaign] = c # if you want a dict rather than Counter, return dict(c) here
In [15]: d
Out[15]:
{'abc': Counter({'campaignCfid': 1578, 'endDate': 912, 'startDate': 246, 'budgetImpressions': 60}),
'def': Counter({'endDate': 789, 'startDate': 456, 'campaignCfid': 123, 'budgetImpressions': 80})}
如果你已经有了这些列表或字典的集合,那就没必要把它们转换成DataFrame,通常在纯Python中处理会更省钱。
1
没错,使用pandas吧。它非常好用。你可以利用它的groupby
功能来进行分组,然后通过求和来汇总数据。如果你想要的结果是一个字典的列表,这样做就可以了。
import pandas as pd
data = [{"startDate": 123, "endDate": 456, "campaignName": 'abc',
"campaignCfid": 789, "budgetImpressions": 10},
{"startDate": 123, "endDate": 456, "campaignName": 'abc',
"campaignCfid": 789, "budgetImpressions": 50},
{"startDate": 456, "endDate": 789, "campaignName": 'def',
"campaignCfid": 123, "budgetImpressions": 80}]
df = pd.DataFrame(data)
grouped = df.groupby(['startDate', 'endDate', 'campaignCfid',
'campaignName']).agg(sum)
print grouped.reset_index().to_dict('records')
这样会输出:
[{'startDate': 123L, 'campaignCfid': 789L, 'endDate': 456L, 'budgetImpressions': 60L, 'campaignName': 'abc'}, {'startDate': 456L, 'campaignCfid': 123L, 'endDate': 789L, 'budgetImpressions': 80L, 'campaignName': 'def'}]