子进程调用的重定向输出丢失?

2024-05-23 19:06:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一些Python代码大致如下所示,使用一些库,您可能有,也可能没有:

# Open it for writing
vcf_file = open(local_filename, "w")

# Download the region to the file.
subprocess.check_call(["bcftools", "view",
    options.truth_url.format(sample_name), "-r",
    "{}:{}-{}".format(ref_name, ref_start, ref_end)], stdout=vcf_file)

# Close parent process's copy of the file object
vcf_file.close()

# Upload it
file_id = job.fileStore.writeGlobalFile(local_filename)

基本上,我正在启动一个子进程,它应该为我下载一些数据并将其打印到标准输出。我将该数据重定向到一个文件,然后,一旦子进程调用返回,我就关闭该文件的句柄,然后将该文件复制到其他地方。在

有时,我不想在cdm文件中写一些不安全的东西,但是之前,子进程写入标准输出的数据会被放到磁盘上,在那里我可以看到它。在

查看C标准(由于BC/ToC是用C/C++实现的),看起来当程序正常退出时,所有打开的流(包括标准输出)都被刷新和关闭。请参见[lib.support.start.term]部分here,描述exit()的行为,当main()返回时,它被隐式调用:

--Next, all open C streams (as mediated by the function signatures declared in ) with unwritten buffered data are flushed, all open C streams are closed, and all files created by calling tmp- file() are removed.30)

--Finally, control is returned to the host environment. If status is zero or EXIT_SUCCESS, an implementation-defined form of the status successful termination is returned. If status is EXIT_FAILURE, an implementation-defined form of the status unsuccessful termination is returned. Otherwise the status returned is implementation-defined.31)

因此,在子进程退出之前,它关闭(并刷新)标准输出。在

但是,Linux的manual page注意到,关闭文件描述符并不一定保证写入其中的任何数据都已实际写入磁盘:

A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a filesystem to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored, use fsync(2). (It will depend on the disk hardware at this point.)

因此,当一个进程退出时,它的标准输出流将被刷新,但如果该流实际上由指向磁盘上文件的文件描述符支持,则不能保证已完成对磁盘的写入。我怀疑这可能就是这里发生的事情。在

所以,我的实际问题是:

  1. 我对规格的读数正确吗?子进程在其父进程看来是否在其重定向的标准输出在磁盘上可用之前已终止?

  2. 有没有可能等到子进程写入文件的所有数据都被操作系统同步到磁盘上?

  3. 我应该在父进程的file对象副本上调用flush()或某个Python版本的fsync()?这样可以强制子进程对同一个文件描述符的写操作提交到磁盘吗?


Tags: 文件oftheto数据ref标准进程
1条回答
网友
1楼 · 发布于 2024-05-23 19:06:48

是的,可能需要几分钟时间才能将数据写入磁盘(物理)。但你可以在那之前读到。在

除非您担心电源故障或内核死机,否则数据是否在磁盘上并不重要。内核是否认为数据被写入的重要部分。在

一旦check_call()返回,就可以安全地读取文件。如果看不到所有数据,则可能表示bcftools中存在错误,或者{}没有上载文件中的所有数据。您可以尝试通过禁用bsftools'标准输出(provide a pseudo-tty, use ^{} command-line utility, etc)的块缓冲模式来解决前者。在

Q: Is my reading of the specs correct? Can a child process appear to its parent to have terminated before its redirected standard output is available on disk?

是的。对。在

Q: Is it possible to somehow wait until all data written by the child process to files has actually been synced to disk by the OS?

不,fsync()在一般情况下是不够的。很可能,您无论如何都不需要它(读回数据与确保数据写入磁盘是另一个问题)。在

Q: Should I be calling flush() or some Python version of fsync() on the parent process's copy of the file object? Can that force writes to the same file descriptor by child processes to be committed to disk?

那是毫无意义的。.flush()刷新父进程内部的缓冲区(可以使用open(filename, 'wb', 0)避免在父进程中创建不必要的缓冲区)。在

^{}使用文件描述符(子对象有自己的文件描述符)。我不知道内核是否为引用同一磁盘文件的不同文件描述符使用不同的缓冲区。同样,如果您发现数据丢失(没有崩溃)也没有关系;fsync()在这里没有帮助。在

Q: Just to be clear, I see that you're asserting that the data should indeed be readable by other processes, because the relevant OS buffers are shared between processes. But what's your source for that assertion? Is there a place in a spec or the Linux documentation you can point to that guarantees that those buffers are shared?

查找"After a ^{} to a regular file has successfully returned"

Any successful read() from each byte position in the file that was modified by that write shall return the data specified by the write() for that position until such byte positions are again modified.

相关问题 更多 >