在不加载到内存的情况下转置大数组

with open("output.txt", w) as out: with gzip.open("file.txt", rt) as file: for line in file: transposed_line = line.transpose() out.write(transposed_line, as.column)

import numpy as np import random # create example array and write to file with open("array.txt", "w") as out: num_columns = 8 num_lines = 24 for i in range(num_lines): line = [] for column in range(num_columns): line.append(str(random.choice([0,1]))) out.write(" ".join(line) + "\n") # iterate over chunks of dimensions num_columns×num_columns, transpose them, and append to file with open("array.txt", "r") as array: with open("transposed_array.txt", "w") as out: for chunk_start in range(0, num_lines, num_columns): # get chunk and transpose chunk = np.genfromtxt(array, max_rows=num_columns, dtype=int).T # write out chunk out.seek(chunk_start+num_columns, 0) np.savetxt(out, chunk, fmt="%s", delimiter=' ', newline='\n')

0 0 0 1 1 0 0 0 0 1 1 0 1 1 0 1 0 1 1 0 1 1 0 0 1 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 1 0 1 0 0 1 0 1 0 1 0 0 0 0 1 0 0 1 0 0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 1 0 1

0 0 0 1 1 0 0 1 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 1 1 1 0 0 0 0 1 1 0 1 1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 0 1 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 1

import numpy as np import random # create example array and write to file num_columns = 4 num_lines = 8 with open("array.txt", "w") as out: for i in range(num_lines): line = [] for column in range(num_columns): line.append(str(random.choice([0,1]))) out.write(" ".join(line) + "\n") # iterate over chunks of dimensions num_columns×chunk_length, transpose them, and append to file chunk_length = 7 with open("array.txt", "r") as array: with open("transposed_array.txt", "w") as out: for chunk_start in range(0, num_lines, chunk_length): # get chunk and transpose chunk = np.genfromtxt(array, max_rows=chunk_length, dtype=str).T # write out chunk empty_line = 2 * (num_lines - (chunk_length + chunk_start)) for i, line in enumerate(chunk): new_pos = 2 * num_lines * i + 2 * chunk_start out.seek(new_pos) out.write(f"{' '.join(line)}{' ' * (empty_line)}"'\n')

2条回答

网友

1楼 · 编辑于 2024-04-20 07:31:07

在工作但缓慢的解决方案中，您将读取输入文件5000次，这不会很快，但要使读取最小化，唯一简单的方法是在内存中全部读取。在

你可以尝试一些折衷办法，比如说，一次在内存中读取50列（约50MB），然后将它们作为行写入文件。这样你就可以把文件“只”读100遍。尝试几种不同的组合来获得您满意的性能/内存折衷。在

您可以在三个嵌套循环上执行此操作：

循环块数（在本例中为100）
循环输入文件的行
循环查看块中的列数（此处为50）

在最内部的循环中，将列值作为一行收集到一个二维数组中，中间循环的每个数组一行。在最外层的循环中，在进入内部循环之前清除数组，然后将其作为行打印到文件中。对于循环1的每次迭代。你将写下50行一百万列。在

如果不将整个目标文件加载到内存中，就不能在普通文件的中间插入—您需要手动向前移动尾随字节。因为你知道你的确切文件大小，然而，你可以预先分配它，并且总是在写入每个字节时寻找位置；可能也不是很快就能完成50亿次搜索。。。如果你的1和0分布得相当均匀，你可以用所有的0初始化文件，然后只写1（或者反过来写），以减少一半的查找次数。在

编辑：添加了如何实现分块的详细信息。在

网友

2楼 · 编辑于 2024-04-20 07:31:07

如果您的数字都是0或1，那么每一行都有相同的长度（以字节为单位），因此您可以使用file.seek在文件中移动（而不是读入并忽略数据）。但是，对于gzip输入文件，这可能不是很有效。由于您正在编写一个未压缩的文件，您还可以使用seek在输出中跳转。在

一种更有效的转置数组的方法是读入一个适合RAM的块（例如1000x1000），使用numpy.transpose来转置块，然后将块写入转置数组中的位置。对于5000列但1M行的数组，使用5000x5000块（即读取）可能是最容易的一次5000行完整的输入矩阵。这避免了在压缩的输入文件中seek。然后，您必须将此块写入输出文件，为来自输入的后续行的列留空。在

有关如何将块写入5000xN输出文件的更多详细信息（如注释中所要求的）：

要写入第一个5000x5000块：

查找文件的开头
写入块的第一行（5000个元素）
查找输出第二行的开头（即文件中偏移量为2N，如果有CRLF行结尾，则为2N+1）
写入块的第二行
查找文件第三行的开头
等等

要写入第二个块：

查找输出第一行的5000位（从零开始）
写入块的第一行
查找第二个输出行的5000位
写入块的第二行
等等

相关问题更多 >

编程相关推荐

热门问题

热门文章