如何使用Pool.map（）解决多处理时的内存问题？

3条回答

网友

1楼 · 编辑于 2024-06-07 00:12:09

使用multiprocessing.Pool时，将使用fork()系统调用创建许多子进程。这些进程中的每一个都是从当时父进程的内存的精确副本开始的。因为在创建大小为3的Pool之前要加载csv，所以池中这3个进程中的每一个都将不必要地拥有数据帧的副本。（gen_matrix_df和gen_matrix_df_list将存在于当前进程以及3个子进程中的每个进程中，因此这些结构的每个副本都将存在于内存中）

在加载应该减少内存使用的文件（实际上是在文件的最开始）之前，尝试创建Pool。

如果仍然太高，您可以：

将gen_matrix_dfu_list转储到文件中，每行1个项目，例如：

import os
import cPickle

with open('tempfile.txt', 'w') as f:
    for item in gen_matrix_df_list.items():
        cPickle.dump(item, f)
        f.write(os.linesep)

在迭代器上对转储到此文件的行使用Pool.imap()，例如：
```
with open('tempfile.txt', 'r') as f:
    p.imap(matrix_to_vcf, (cPickle.loads(line) for line in f))
```
（注意，matrix_to_vcf在上面的示例中接受一个(key, value)元组，而不仅仅是一个值）

我希望这会有帮助。

注：我还没有测试过上面的代码。只是为了证明这个想法。

网友

2楼 · 编辑于 2024-06-07 00:12:09

先决条件

在Python中（在下面我使用Python 3.6.5的64位构建），一切都是一个对象。这有其开销，使用^{}我们可以准确地看到对象的大小（字节）：
```
>>> import sys
>>> sys.getsizeof(42)
28
>>> sys.getsizeof('T')
50
```
当使用fork系统调用（默认为*nix，请参阅multiprocessing.get_start_method()）创建子进程时，不会复制父进程的物理内存，而是使用copy-on-write技术。
Fork子进程仍将报告父进程的完整RSS（常驻集大小）。因此，PSS（比例集大小）是估计分叉应用程序的内存使用情况的更合适的度量。以下是页面中的一个示例：

Process A has 50 KiB of unshared memory
Process B has 300 KiB of unshared memory
Both process A and process B have 100 KiB of the same shared memory region
Since the PSS is defined as the sum of the unshared memory of a process and the proportion of memory shared with other processes, the PSS for these two processes are as follows:
PSS of process A = 50 KiB + (100 KiB / 2) = 100 KiB
PSS of process B = 300 KiB + (100 KiB / 2) = 350 KiB

数据帧

不让我们单独看你的DataFrame。^{}会帮助我们的。

仅Pd.py

#!/usr/bin/env python3

import pandas as pd
from memory_profiler import profile

@profile
def main():
    with open('genome_matrix_header.txt') as header:
        header = header.read().rstrip('\n').split('\t')

    gen_matrix_df = pd.read_csv(
        'genome_matrix_final-chr1234-1mb.txt', sep='\t', names=header)

    gen_matrix_df.info()
    gen_matrix_df.info(memory_usage='deep')

if __name__ == '__main__':
    main()

现在让我们使用探查器：

mprof run justpd.py
mprof plot

我们可以看到情节：

逐行追踪：

Line #    Mem usage    Increment   Line Contents
================================================
     6     54.3 MiB     54.3 MiB   @profile
     7                             def main():
     8     54.3 MiB      0.0 MiB       with open('genome_matrix_header.txt') as header:
     9     54.3 MiB      0.0 MiB           header = header.read().rstrip('\n').split('\t')
    10                             
    11   2072.0 MiB   2017.7 MiB       gen_matrix_df = pd.read_csv('genome_matrix_final-chr1234-1mb.txt', sep='\t', names=header)
    12                                 
    13   2072.0 MiB      0.0 MiB       gen_matrix_df.info()
    14   2072.0 MiB      0.0 MiB       gen_matrix_df.info(memory_usage='deep')

我们可以看到，数据帧在构建时需要~2 GiB，峰值为~3 GiB。更有趣的是^{}的输出。

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000000 entries, 0 to 3999999
Data columns (total 34 columns):
...
dtypes: int64(2), object(32)
memory usage: 1.0+ GB

但是info(memory_usage='deep')（“deep”是指通过询问objectdtypes对数据进行深入的自省，见下文）给出：

memory usage: 7.9 GB

啊？！从这个过程的外部来看，我们可以确保memory_profiler的数字是正确的。sys.getsizeof还显示帧的相同值（很可能是由于自定义的__sizeof__），使用它来估计分配的gc.get_objects()的其他工具也会显示相同的值，例如^{}。

# added after read_csv
from pympler import tracker
tr = tracker.SummaryTracker()
tr.print_diff()

给出：

                                             types |   # objects |   total size
================================================== | =========== | ============
                 <class 'pandas.core.series.Series |          34 |      7.93 GB
                                      <class 'list |        7839 |    732.38 KB
                                       <class 'str |        7741 |    550.10 KB
                                       <class 'int |        1810 |     49.66 KB
                                      <class 'dict |          38 |      7.43 KB
  <class 'pandas.core.internals.SingleBlockManager |          34 |      3.98 KB
                             <class 'numpy.ndarray |          34 |      3.19 KB

那么这7.93吉布是从哪里来的呢？我们来解释一下。我们有4M行和34列，这给了我们134M的值。它们要么是int64，要么是object（这是一个64位指针；有关详细说明，请参见using pandas with large data）。因此，我们只有134 * 10 ** 6 * 8 / 2 ** 20~1022 MiB用于数据帧中的值。剩下的6.93吉布呢？

串接

为了理解这种行为，有必要知道Python确实在进行string interning。关于Python 2中的字符串实习，有两篇很好的文章（one，two）。除了Python 3中的Unicode更改和python3.3中的PEP 393之外，C结构也发生了变化，但想法是一样的。基本上，看起来像标识符的每个短字符串都将由Python缓存在内部字典中，并且引用将指向相同的Python对象。换言之，我们可以说它的行为就像一个单身汉。上面我提到的文章解释了它提供了什么重要的内存配置文件和性能改进。我们可以使用PyASCIIObject的^{}字段检查字符串是否已被暂存：

import ctypes

class PyASCIIObject(ctypes.Structure):
     _fields_ = [
         ('ob_refcnt', ctypes.c_size_t),
         ('ob_type', ctypes.py_object),
         ('length', ctypes.c_ssize_t),
         ('hash', ctypes.c_int64),
         ('state', ctypes.c_int32),
         ('wstr', ctypes.c_wchar_p)
    ]

然后：

>>> a = 'name'
>>> b = '!@#$'
>>> a_struct = PyASCIIObject.from_address(id(a))
>>> a_struct.state & 0b11
1
>>> b_struct = PyASCIIObject.from_address(id(b))
>>> b_struct.state & 0b11
0

使用两个字符串，我们还可以进行标识比较（在CPython的情况下，在内存中进行寻址比较）。

>>> a = 'foo'
>>> b = 'foo'
>>> a is b
True
>> gen_matrix_df.REF[0] is gen_matrix_df.REF[6]
True

因此，对于objectdtype，数据帧最多分配20个字符串（每个氨基酸一个）。不过，值得注意的是，Pandas推荐使用categorical types进行枚举。

熊猫记忆

因此，我们可以解释7.93吉布的天真估计：

>>> rows = 4 * 10 ** 6
>>> int_cols = 2
>>> str_cols = 32
>>> int_size = 8
>>> str_size = 58  
>>> ptr_size = 8
>>> (int_cols * int_size + str_cols * (str_size + ptr_size)) * rows / 2 ** 30
7.927417755126953

注意str_size是58字节，而不是我们在上面看到的1字符文本的50字节。这是因为PEP 393定义了压缩字符串和非压缩字符串。你可以用sys.getsizeof(gen_matrix_df.REF[0])检查。

实际内存消耗应该是gen_matrix_df.info()报告的~1gib，是它的两倍。我们可以假设这与熊猫或小熊猫的记忆（预）分配有关。下面的实验表明这并非没有原因（多次运行显示保存图片）：

Line #    Mem usage    Increment   Line Contents
================================================
     8     53.1 MiB     53.1 MiB   @profile
     9                             def main():
    10     53.1 MiB      0.0 MiB       with open("genome_matrix_header.txt") as header:
    11     53.1 MiB      0.0 MiB           header = header.read().rstrip('\n').split('\t')
    12                             
    13   2070.9 MiB   2017.8 MiB       gen_matrix_df = pd.read_csv('genome_matrix_final-chr1234-1mb.txt', sep='\t', names=header)
    14   2071.2 MiB      0.4 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[gen_matrix_df.keys()[0]])
    15   2071.2 MiB      0.0 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[gen_matrix_df.keys()[0]])
    16   2040.7 MiB    -30.5 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
    ...
    23   1827.1 MiB    -30.5 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
    24   1094.7 MiB   -732.4 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
    25   1765.9 MiB    671.3 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
    26   1094.7 MiB   -671.3 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
    27   1704.8 MiB    610.2 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
    28   1094.7 MiB   -610.2 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
    29   1643.9 MiB    549.2 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
    30   1094.7 MiB   -549.2 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
    31   1582.8 MiB    488.1 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
    32   1094.7 MiB   -488.1 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])    
    33   1521.9 MiB    427.2 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])    
    34   1094.7 MiB   -427.2 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
    35   1460.8 MiB    366.1 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
    36   1094.7 MiB   -366.1 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
    37   1094.7 MiB      0.0 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])
    ...
    47   1094.7 MiB      0.0 MiB       gen_matrix_df = gen_matrix_df.drop(columns=[random.choice(gen_matrix_df.keys())])

我想以熊猫的原著作者fresh article about design issues and future Pandas2的一句话来结束这一部分。

pandas rule of thumb: have 5 to 10 times as much RAM as the size of your dataset

过程树

最后，让我们来到池中，看看是否可以在写时使用copy。我们将使用^{}（可从Ubuntu存储库获得）来估计进程组内存共享，并使用^{}来记下系统范围的可用内存。两者都可以编写JSON。

我们将运行原始脚本Pool(2)。我们需要3个终端窗口。

smemstat -l -m -p "python3.6 script.py" -o smemstat.json 1
glances -t 1 --export-json glances.json
mprof run -M script.py

然后mprof plot产生：

总和图表（mprof run --nopython --include-children ./script.py）看起来像：

注意上面的两个图表显示RSS。假设是，由于写时拷贝，它并不能反映实际的内存使用情况。现在我们有两个来自smemstat和glances的JSON文件。我将使用以下脚本将JSON文件转换为CSV。

#!/usr/bin/env python3

import csv
import sys
import json

def smemstat():
  with open('smemstat.json') as f:
    smem = json.load(f)

  rows = []
  fieldnames = set()    
  for s in smem['smemstat']['periodic-samples']:
    row = {}
    for ps in s['smem-per-process']:
      if 'script.py' in ps['command']:
        for k in ('uss', 'pss', 'rss'):
          row['{}-{}'.format(ps['pid'], k)] = ps[k] // 2 ** 20

    # smemstat produces empty samples, backfill from previous
    if rows:            
      for k, v in rows[-1].items():
        row.setdefault(k, v)

    rows.append(row)
    fieldnames.update(row.keys())

  with open('smemstat.csv', 'w') as out:
    dw = csv.DictWriter(out, fieldnames=sorted(fieldnames))
    dw.writeheader()
    list(map(dw.writerow, rows))

def glances():
  rows = []
  fieldnames = ['available', 'used', 'cached', 'mem_careful', 'percent',
    'free', 'mem_critical', 'inactive', 'shared', 'history_size',
    'mem_warning', 'total', 'active', 'buffers']
  with open('glances.csv', 'w') as out:
    dw = csv.DictWriter(out, fieldnames=fieldnames)
    dw.writeheader()
    with open('glances.json') as f:
      for l in f:
        d = json.loads(l)
        dw.writerow(d['mem'])

if __name__ == '__main__':
  globals()[sys.argv[1]]()

首先让我们看看free内存。

第一个和最小值之间的差异约为4.15 GiB。PSS数据如下：

和总数：

因此我们可以看到，由于写时拷贝，实际内存消耗约为4.15 GiB。但我们仍在序列化数据，以便通过Pool.map将其发送到工作进程。我们能不能也利用这里的抄写功能？

共享数据

要使用写时拷贝，我们需要让list(gen_matrix_df_list.values())可以全局访问，这样fork之后的工作线程仍然可以读取它。

让我们在main中的del gen_matrix_df之后修改代码，如下所示：

...
global global_gen_matrix_df_values
global_gen_matrix_df_values = list(gen_matrix_df_list.values())
del gen_matrix_df_list

p = Pool(2)
result = p.map(matrix_to_vcf, range(len(global_gen_matrix_df_values)))
...

删除后面的del gen_matrix_df_list。

修改matrix_to_vcf的第一行如下：

def matrix_to_vcf(i):
    matrix_df = global_gen_matrix_df_values[i]

现在让我们重新运行它。可用内存：

进程树：

和它的总和：

因此，我们的实际内存使用量最多约为2.9gib（在构建数据帧时，主进程的峰值是2.9gib），而写时拷贝有帮助！

顺便说一下，这里有所谓的读时复制，即Python的引用循环垃圾收集器described in Instagram Engineering（这导致issue31558中的gc.freeze）的行为。但是gc.disable()在这种特殊情况下没有影响。

更新

另一种替代的方法是使用^{}从一开始就将数据共享委托给内核。以下是Python中高性能数据处理的谈话。然后tricky part将使熊猫使用mmaped Numpy数组。

网友

3楼 · 编辑于 2024-06-07 00:12:09

我也有同样的问题。我需要处理一个庞大的文本语料库，同时在内存中保存数百万行的少量数据帧的知识库。我认为这个问题很常见，所以我会把我的答案放在一般的目的上。

设置的组合为我解决了问题（1&3&5可能只为您解决）：

使用Pool.imap（或imap_unordered）而不是Pool.map。这将在数据上缓慢地迭代，而不是在开始处理之前将所有数据加载到内存中。
将值设置为chunksize参数。这也会使imap更快。
将值设置为maxtasksperchild参数。
将输出追加到磁盘而不是内存中。当它达到一定尺寸时立即或每隔一段时间。
分批运行代码。如果有迭代器，可以使用itertools.islice。这样做的目的是将list(gen_matrix_df_list.values())分成三个或更多的列表，然后只将第一个第三个列表传递给map或imap，然后在另一次运行中传递第二个第三个列表，等等。因为您有一个列表，所以只需在同一行代码中对其进行切片即可。

先决条件

数据帧

串接

熊猫记忆

过程树

共享数据

更新

相关问题更多 >

编程相关推荐

热门问题

热门文章