python将巨大的列表拆分成多个列表；在每个列表上循环

2条回答

网友

1楼 · 编辑于 2024-04-24 03:05:04

把动词列在一个列表中：

verbs = [words[:3269],words[3269:13080],words[13081:9811],words[9812:6542],
         words[6543:3273],words[3274:len(words)]]

然后使用该列表的长度创建一个长度相同的循环。通过使用索引，我们可以创建路径并访问谓词中的正确元素。你知道吗

for i in range(len(verbs)):
    output = '{}ortho{}.csv'.format(path,i+1)
    with open(output, 'wb') as f:  
        writer = csv.writer(f, delimiter=",", lineterminator="\n")   
        for a, b in itertools.product(verbs[i], words):        
            if (a < b and Levenshtein.distance(a,b) <= 5):
               writer.writerow([a, b, Levenshtein.distance(a,b)])

网友

2楼 · 编辑于 2024-04-24 03:05:04

您的代码存在一些问题，您还可以改进以下几点：

不要为verbs和output各有六个不同的变量，而是使用两个列表；这样可以更容易地调整“拆分点”或子列表的数量，并且不必复制粘贴代码块来比较六次单词；只需使用另一个循环即可
子列表words[13081:9811]是空的，并且第二个索引小于第一个索引的任何其他索引也是空的
使用verbs1 = words[:3269]和verbs2 = words[3269:13080]，words[3269]将位于子列表的或中，因为第二个索引是独占的；以下列表也是如此
为了防止这是您的意图，拆分列表将而不是减少复杂性或运行时间，因为您仍然需要比较每个单词；a*x + b*x + c*x与(a+b+c) * x相同
与其检查a < b并取消product的一半，不如改用^{}（但这只在而不是拆分列表时有效）
如果您只对具有编辑距离<= 5的对感兴趣，可以先进行一些其他检查，例如比较两个单词的长度，或设置包含字符的差异；这两种检查都比实际的编辑距离检查快，即O（n²），并且可能排除许多组合
出于同样的原因，不要计算编辑距离两次，一次在检查中，一次在将其写入文件时，而只计算一次并将其存储在临时变量中
如果您分割文件，以便输出文件不会变得太大，Excel无法处理（据我所知，您的一条注释），您的方法可能不起作用，因为输出文件的大小可能会急剧变化，这取决于子列表中有多少匹配项

结合以上内容，您可以尝试以下方法（未经测试）：

path = '/Users/path/'
with open(path + 'wordlist.txt') as infile:
    words = set(s.strip() for s in infile)

combs = itertools.combinations(words, 2)
max_count = 10**6 # or whatever Excel can handle
for i, chunk in enumerate(chunks(combs, max_count)):
    with open("%sortho%d.csv" % (path, i), "w") as outfile:
        writer = csv.writer(outfile, delimiter=",", lineterminator="\n")   
        for a, b in chunk:
            if might_be_close(a, b, 5):
                d = Levenshtein.distance(a,b)
                if d <= 5:
                    writer.writerow([a, b, d])

这里，chunks是split an iterator into chunks的函数，might_be_close是比较例如长度或所包含字母集的辅助函数，如上所述。output文件的大小可能仍然不同，但永远不会超过max_count。你知道吗

或者尝试这样做，以获得具有确切max_count项的输出文件：

max_count = 10**6 # or whatever Excel can handle
matches = filter_matches(itertools.combinations(words, 2), 5)
for i, chunk in enumerate(chunks(matches, max_count)):
    with open("%sortho%d.csv" % (path, i), "w") as outfile:
        writer = csv.writer(outfile, delimiter=",", lineterminator="\n")   
        for a, b, d in chunk:
            writer.writerow([a, b, d])

def filter_matches(combs, max_dist):
    for a, b in combs:
        if might_be_close(a, b, max_dist):
            d = Levenshtein.distance(a,b)
            if d <= max_dist:
                yield a, b, d

这里，filter_matches生成器对组合进行预过滤，我们将它们分块到正确的大小。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章

python将巨大的列表拆分成多个列表；在每个列表上循环

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >