根据模式将一个文件拆分为多个文件（可以在行内进行剪切）

3条回答

网友

1楼 · 编辑于 2024-06-08 05:57:24

Perl可以逐行解析大型文件，而不是将整个文件拖入内存。下面是一个简短的脚本（带说明）：

perl -n -E 'if (/(.*)(<\?xml.*)/ ) {
   print $fh $1 if $1;
   open $fh, ">output." . ++$i;
   print $fh $2;
} else { print $fh $_ }'  in.txt

perl -n：-n标志将逐行循环覆盖您的文件（将内容设置为$）

-E：执行以下文本（Perl默认需要一个文件名）

if (/(.*)(<\?xml.*) )如果行匹配<?xml请将该行（使用regex匹配）拆分为$1和$2。

print $fh $1 if $1将行首打印到旧文件。

open $fh, ">output.". ++$i;创建用于写入的新文件句柄。

print $fh $2将该行的其余部分打印到新文件。

} else { print $fn $_ }如果行不匹配<?xml只需将其打印到当前文件句柄。

注意：这个脚本假设您的输入文件以<?xml开头。

网友

2楼 · 编辑于 2024-06-08 05:57:24

在不将所有内容读入RAM的情况下执行拆分：

def files():
    n = 0
    while True:
        n += 1
        yield open('/output/dir/%d.part' % n, 'w')


pat = '<?xml'
fs = files()
outfile = next(fs) 

with open(filename) as infile:
    for line in infile:
        if pat not in line:
            outfile.write(line)
        else:
            items = line.split(pat)
            outfile.write(items[0])
            for item in items[1:]:
                outfile = next(fs)
                outfile.write(pat + item)

警告：如果您的模式跨越多行（即包含“\n”），则此操作不起作用。如果是这样，请考虑mmap solution。

网友

3楼 · 编辑于 2024-06-08 05:57:24

对于这种大小的文件，可能需要使用^{}模块，这样就不必自己处理文件的分块。从那里的文件来看：

Memory-mapped file objects behave like both strings and like file objects. Unlike normal string objects, however, these are mutable. You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file. Since they’re mutable, you can change a single character by doing obj[index] = 'a', or change a substring by assigning to a slice: obj[i1:i2] = '...'. You can also read and write data starting at the current file position, and seek() through the file to different positions.

下面是一个简单的示例，它向您展示了如何在文件中找到<?xml #>的每个匹配项。你可以一边写，一边写，但我还没写。

import mmap
import re

# a regex to match the "xml" nodes
r = re.compile(r'\<\?xml\s\d+\>')

with open('so.txt','r+b') as f:
    mp = mmap.mmap(f.fileno(),0)
    for m in r.finditer(mp):
        # here you can start collecting the starting positions and 
        # writing chunks to new files 
        print m.start()

相关问题更多 >

编程相关推荐

热门问题

热门文章