我在elasticsearch中有一组11mm文档,每个文档都有一个标识符数组。每个标识符都是一个包含类型、值和日期的dict。下面是一个示例记录:
{
"name": "Bob",
"identifiers": [
{
"date": "2019-01-01",
"type": "a",
"value": "abcd"
},
{
"date": "2019-01-01",
"type": "b",
"value": "efgh"
}
]
}
我需要每晚将这些记录传输到一个parquet数据存储中,在这个存储中只有标识符的值保存在一个数组中。比如:
{
"name": "Bob",
"identifiers": ["abcd", "efgh"]
}
我是通过循环遍历所有的记录并平展标识符来实现的。这是我的扁平变压器:
def _transform_identifier_values(self, identifiers: List[dict]):
ret = [
identifier['value']
for identifier in identifiers
]
return ret
这很管用,但很慢。有没有更快捷的方法?可能是我可以利用的本地实现
编辑:
尝试了Sunny的建议。我惊讶地发现原作实际上表现得最好。我的假设是itemgetter
会更有效
我是这样测试的:
import time
from functools import partial
from operator import itemgetter
def main():
docs = []
for i in range(10_000_000):
docs.append({
'name': 'Bob',
'identifiers': [
{
'date': '2019-01-01',
'type': 'a',
'value': 'abcd'
},
{
'date': '2019-01-01',
'type': 'b',
'value': 'efgh'
}
]
})
start = time.time()
for doc in docs:
_transform_identifier_values_original(doc['identifiers'])
end = time.time()
print(f'Original took {end-start} seconds')
start = time.time()
for doc in docs:
_transform_identifier_values_getter(doc['identifiers'])
end = time.time()
print(f'Item getter took {end-start} seconds')
start = time.time()
for doc in docs:
_transform_identifier_values_partial_lambda(doc['identifiers'])
end = time.time()
print(f'Lambda partial took {end-start} seconds')
start = time.time()
for doc in docs:
_transform_identifier_values_partial(doc['identifiers'])
end = time.time()
print(f'Partial took {end-start} seconds')
def _transform_identifier_values_original(identifiers):
ret = [
identifier['value']
for identifier in identifiers
]
return ret
def _transform_identifier_values_getter(identifiers):
return list(map(itemgetter('value'), identifiers))
def _transform_identifier_values_partial_lambda(identifiers):
flatten_ids = partial(lambda o: list(map(itemgetter('value'), o)))
return flatten_ids(identifiers)
def _transform_identifier_values_partial(identifiers):
flatten = partial(map, itemgetter('value'))
return list(flatten(identifiers))
if __name__ == '__main__':
main()
结果:
Original took 4.6204328536987305 seconds
Item getter took 7.186180114746094 seconds
Lambda partial took 10.534514904022217 seconds
Partial took 9.07079291343689 seconds
我想出了一个解决办法:
此函数将接收单个词典并以您所需的新格式返回词典。然后可以从内置的json库中^{} 函数。它接收字典列表并将它们转储到json文件中
你可以试着利用^{}
甚至可以把它变成partial function:
如果您想避开lambda,可以尝试:
我很想看看结果
相关问题 更多 >
编程相关推荐