numpy.memmap 用于字符串数组吗？

7 投票

2 回答

2936 浏览

数据工程师

提问于 2025-04-16 17:02

有没有办法使用 numpy.memmap 将一个很大的存储在磁盘上的字符串数组映射到内存中呢？

我知道对于浮点数等类型是可以做到的，但我这次的问题特别是关于字符串的。

我对固定长度和可变长度字符串的解决方案都很感兴趣。

解决方案可以使用任何合理的文件格式。

2 个回答

最灵活的选择是换成数据库或者其他更复杂的文件结构。

不过，你可能有很好的理由想把东西保持为普通文本文件……

因为你可以控制文件的创建方式，所以一个选择就是简单地写一个第二个文件，这个文件只包含第一个文件中每个字符串的起始位置（以字节为单位）。

这会需要多一点工作，但基本上你可以这样做：

class IndexedText(object):
    def __init__(self, filename, mode='r'):
        if mode not in ['r', 'w', 'a']:
            raise ValueError('Only read, write, and append is supported')
        self._mainfile = open(filename, mode)
        self._idxfile = open(filename+'idx', mode)

        if mode != 'w':
            self.indicies = [int(line.strip()) for line in self._idxfile]
        else:
            self.indicies = []

    def __enter__(self):
        return self

    def __exit__(self, type, value, traceback):
        self._mainfile.close()
        self._idxfile.close()

    def __getitem__(self, idx):
        position = self.indicies[idx]
        self._mainfile.seek(position)
        # You might want to remove the automatic stripping...
        return self._mainfile.readline().rstrip('\n')

    def write(self, line):
        if not line.endswith('\n'):
            line += '\n'
        position = self._mainfile.tell()
        self.indicies.append(position)
        self._idxfile.write(str(position)+'\n')
        self._mainfile.write(line)

    def writelines(self, lines):
        for line in lines:
            self.write(line)


def main():
    with IndexedText('test.txt', 'w') as outfile:
        outfile.write('Yep')
        outfile.write('This is a somewhat longer string!')
        outfile.write('But we should be able to index this file easily')
        outfile.write('Without needing to read the entire thing in first')

    with IndexedText('test.txt', 'r') as infile:
        print infile[2]
        print infile[0]
        print infile[3]

if __name__ == '__main__':
    main()

回答于 2025-04-16 由 Python大师

分享举报

如果所有的字符串长度都一样，正如“数组”这个词所暗示的那样，这样做是很简单的：

a = numpy.memmap("data", dtype="S10")

这就是一个长度为10的字符串的例子。

补充说明：因为显然这些字符串的长度不一样，所以你需要对文件进行索引，这样才能快速访问每个项目，时间复杂度是O(1)。这就需要先读取整个文件一次，然后把所有字符串的起始位置存储在内存中。不过，我觉得用NumPy来做索引的话，必须先创建一个和文件一样大的数组在内存中。提取完索引后，这个数组可以删除。

回答于 2025-04-16 由 Python大师

分享举报

numpy.memmap 用于字符串数组吗？

2 个回答

撰写回答