将重复的字典项转换为具有ID数组的唯一项

[ {'name': 'Craig McKray', 'document_id': 50, 'annotation_id': 8}, {'name': 'None on file', 'document_id': 40, 'annotation_id': 5}, {'name': 'Craig McKray', 'document_id': 50, 'annotation_id': 9}, {'name': 'Western Union', 'document_id': 61, 'annotation_id': 11} ]

[ {'name': 'Craig McKray', 'document_ids': [50], 'annotation_ids': [8, 9]}, {'name': 'None on file', 'document_ids': [40], 'annotation_id': [5]}, {'name': 'Western Union', 'document_ids': [61], 'annotation_ids': [11]} ]

result = [] # resolve duplicate names result_row = defaultdict(list) for item in data: for double in data: if item['name'] == double['name']: result_row['name'] = item['name'] result_row['record_ids'].append(item['document_id']) result_row['annotation_ids'].append(item['annotation_id']) result.append(result_row)

3条回答

网友

1楼 · 编辑于 2024-06-16 10:13:29

new = dict()
for x in people:
    if x['name'] in new:
        new[x['name']].append({'document_id': x['document_id'], 'annotation_id': x['annotation_id']})
    else:
        new[x['name']] = [{'document_id': x['document_id'], 'annotation_id': x['annotation_id']}]

它不完全是你想要的，但是格式应该做你想做的。你知道吗

这是输出：

{'Craig McKray': [{'annotation_id': 8, 'document_id': 50}, {'annotation_id': 9, 'document_id': 50}], 'Western Union': [{'annotation_id': 11, 'document_id': 61}], 'None on file': [{'annotation_id': 5, 'document_id': 40}]}

在这里，我想这可能对你更好：

from collections import defaultdict
new = defaultdict(dict)

for x in people:
    if x['name'] in new:
        new[x['name']]['document_ids'].append(x['document_id'])
        new[x['name']]['annotation_ids'].append(x['annotation_id'])
    else:
        new[x['name']]['document_ids'] = [x['document_id']]
        new[x['name']]['annotation_ids'] = [x['annotation_id']]

网友

2楼 · 编辑于 2024-06-16 10:13:29

我对这个问题的看法：

result = []
# resolve duplicate names
all_names = []
for i, item in enumerate(data):
    if item['name'] in all_names:
        continue
    result_row = {'name': item['name'], 'record_ids': [item['document_id']],
                  'annotation_ids':[item['annotation_id']]}
    all_names.append(item['name'])
    for j, double in enumerate(data):
        if item['name'] == double['name'] and i != j:
            result_row['record_ids'].append(double['document_id'])
            result_row['annotation_ids'].append(double['annotation_id'])
        result.append(result_row)

网友

3楼 · 编辑于 2024-06-16 10:13:29

一个更实用的itertools.groupby方法可能是这样的。有点神秘，我来解释。你知道吗

from itertools import groupby
from operator import itemgetter

inp = [
    {'name': 'Craig McKray', 'document_id': 50, 'annotation_id': 8}, 
    {'name': 'None on file', 'document_id': 40, 'annotation_id': 5},
    {'name': 'Craig McKray', 'document_id': 50, 'annotation_id': 9},
    {'name': 'Western Union', 'document_id': 61, 'annotation_id': 11}
]

def groupvals(vals):

    namegetter = itemgetter('name')
    doccanngetter = itemgetter('document_id', 'annotation_id')

    for grouper, grps in groupby(sorted(vals, key=namegetter), key=namegetter):

        docanns = [set(param) for param in zip(*(doccanngetter(g) for g in grps))]
        yield {'name': grouper, 'document_id': list(docanns[0]), 'annotation_id': list(docanns[1])}


for result in groupvals(inp):
    print(result)

要使用groupby，您需要一个排序的列表。所以先按名字排序。然后是groupby名称。接下来您可以拉出document_id和annotation_id参数并压缩它们。这样做的效果是将所有document_ids放在一个列表中，将所有annotation_id放在另一个列表中。然后可以调用set删除重复项，并使用生成器将每个元素生成为dict。你知道吗

我使用了一个生成器，因为它避免了建立结果列表的需要。如果你愿意的话你也可以这么做。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章