从数组中提取值的最快方法?

2024-04-25 20:39:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我在elasticsearch中有一组11mm文档,每个文档都有一个标识符数组。每个标识符都是一个包含类型、值和日期的dict。下面是一个示例记录:

{
  "name": "Bob",
  "identifiers": [
    {
      "date": "2019-01-01",
      "type": "a",
      "value": "abcd"
    },
    {
      "date": "2019-01-01",
      "type": "b",
      "value": "efgh"
    }
  ]
}

我需要每晚将这些记录传输到一个parquet数据存储中,在这个存储中只有标识符的值保存在一个数组中。比如:

{
  "name": "Bob",
  "identifiers": ["abcd", "efgh"]
}

我是通过循环遍历所有的记录并平展标识符来实现的。这是我的扁平变压器:

    def _transform_identifier_values(self, identifiers: List[dict]):
        ret = [
            identifier['value']
            for identifier in identifiers
        ]
        return ret

这很管用,但很慢。有没有更快捷的方法?可能是我可以利用的本地实现

编辑:

尝试了Sunny的建议。我惊讶地发现原作实际上表现得最好。我的假设是itemgetter会更有效

我是这样测试的:

import time
from functools import partial
from operator import itemgetter


def main():

    docs = []
    for i in range(10_000_000):
        docs.append({
            'name': 'Bob',
            'identifiers': [
                {
                    'date': '2019-01-01',
                    'type': 'a',
                    'value': 'abcd'
                },
                {
                    'date': '2019-01-01',
                    'type': 'b',
                    'value': 'efgh'
                }
            ]
        })

    start = time.time()
    for doc in docs:
        _transform_identifier_values_original(doc['identifiers'])
    end = time.time()

    print(f'Original took {end-start} seconds')

    start = time.time()
    for doc in docs:
        _transform_identifier_values_getter(doc['identifiers'])
    end = time.time()

    print(f'Item getter took {end-start} seconds')

    start = time.time()
    for doc in docs:
        _transform_identifier_values_partial_lambda(doc['identifiers'])
    end = time.time()

    print(f'Lambda partial took {end-start} seconds')

    start = time.time()
    for doc in docs:
        _transform_identifier_values_partial(doc['identifiers'])
    end = time.time()

    print(f'Partial took {end-start} seconds')


def _transform_identifier_values_original(identifiers):
    ret = [
        identifier['value']
        for identifier in identifiers
    ]
    return ret


def _transform_identifier_values_getter(identifiers):
    return list(map(itemgetter('value'), identifiers))


def _transform_identifier_values_partial_lambda(identifiers):
    flatten_ids = partial(lambda o: list(map(itemgetter('value'), o)))
    return flatten_ids(identifiers)


def _transform_identifier_values_partial(identifiers):
    flatten = partial(map, itemgetter('value'))
    return list(flatten(identifiers))

if __name__ == '__main__':
    main()

结果:

Original took 4.6204328536987305 seconds

Item getter took 7.186180114746094 seconds

Lambda partial took 10.534514904022217 seconds

Partial took 9.07079291343689 seconds


Tags: infordoctimevaluedeftransformpartial
2条回答

我想出了一个解决办法:

def changeJSON(dictionary):
    new_dict = {'name': dictionary['name'], 'identifiers': []}
    for i in dictionary['identifiers']:
        new_dict['identifiers'].append(i['value'])
    return new_dict

此函数将接收单个词典并以您所需的新格式返回词典。然后可以从内置的json库中^{}函数。它接收字典列表并将它们转储到json文件中

你可以试着利用^{}

from operator import itemgetter
def _transform_identifier_values(self, identifiers: List[dict]):
    return list(map(itemgetter('value'), identifiers))

甚至可以把它变成partial function

from operator import itemgetter
from functools import partial
flatten_ids = partial(lambda o: list(map(itemgetter('value'), o['identifiers'])))
print(flatten_ids(obj))

如果您想避开lambda,可以尝试:

flatten = partial(map, itemgetter('value'))
print(list(flatten(obj['identifiers'])))

我很想看看结果

相关问题 更多 >