比dict更快的为python字符串分配索引的方法

feifei77.w70-e2.ezcname.com reseauocoz.cluster007.ovh.net cse-web-cl.comunique-se.com.br ext-cust.squarespace.com ext-cust.squarespace.com ext-cust.squarespace.com ext-cust.squarespace.com ghs.googlehosted.com isutility.web9.hubspot.com sendv54sxu8f12g.ihance.net sites.smarsh.io www.triblocal.com.s3-website-us-east-1.amazonaws.com *.2bask.com *.819.cn

2条回答

网友

1楼 · 编辑于 2024-04-20 07:38:00

代码的瓶颈是for循环期间的w.write。先生成dict，然后写入文件，这样运行速度会快得多。你知道吗

网友

2楼 · 编辑于 2024-04-20 07:38:00

使用set而不是dict对内存稍微友好一些。使用位于https://docs.python.org/3/library/itertools.html的itertools文档中的unique_everseen()示例，可以执行以下操作：

for idx, word in enumerate(unique_everseen(reader), 1):
    print(idx)

另一种可以扩展到更大数据集的方法是使用某种持久的键/值存储，将数据存储在磁盘上（而不是内存中的映射），例如使用LevelDB（使用Plyvel），它可以如下所示：

import itertools
import plyvel

db = plyvel.DB('my-database', create_if_missing=True)
cnt = itertools.count(1)  # start counting at 1
for word in reader:
    key = word.encode('utf-8')
    value = db.get(key)
    if value is not None:
        # We've seen this word before.
        idx = int(value)
    else:
        # We've not seen this word before.
        idx = next(cnt)
        db.put(key, str(idx).encode('ascii'))

    print(idx)

相关问题更多 >

编程相关推荐

热门问题

热门文章