在python中高效地提取大型文件的子集

def get_shorter_subset(fname, new_len): """Extract a random shorter subset of length new_len from a given file""" out_lines = [] with open(fname + "short.out", 'w') as out_file: with open(fname, 'r') as in_file: all_lines = in_file.readlines() total = len(all_lines) print "Total lines:", total for i in range(new_len): line = np.random.choice(all_lines) out_lines.append(line.rstrip('\t\r\n')) #out_file.write(line.rstrip('\t\r\n')) print "Done with", i, "lines" all_lines.remove(line) out_file.write("\n".join(out_lines))

3条回答

网友

1楼 · 编辑于 2024-05-23 18:40:22

所以，问题是：

all_lines = in_file.readlines()将所有行读入内存可能不是最好的方法。。。但是如果你要这样做，那么绝对不要这样做：all_lines.remove(line)，因为这是一个O（N）运算，你在循环中做，给你二次复杂度。你知道吗

我怀疑您只需做一些具有以下效果的事情，就可以获得巨大的性能改进：

idx = np.arange(total, dtype=np.int32)
idx = np.random.choice(idx, size=new_len, replace=False)
for i in idx:
    outfile.write(all_lines[i])

网友

2楼 · 编辑于 2024-05-23 18:40:22

读入所有行，将它们保存在内存中，然后对生成的文本执行250K大字符串操作。每次从文件中删除一行时，Python都必须为其余的行创建一个新的副本。你知道吗

相反，只需随机抽样。例如，如果有500万行，则需要文件的5%。读文件，一行一行。滚动一个随机浮点数。如果<；=0.05，则将该行写入输出。你知道吗

对于如此大的样本，您将得到所需大小的输出。你知道吗

网友

3楼 · 编辑于 2024-05-23 18:40:22

您也可以尝试使用mmap：

https://docs.python.org/3.6/library/mmap.html

相关问题更多 >

编程相关推荐

热门问题

热门文章