Python:如何将一个.txt文件拆分为两个或多个文件,每个文件中的行数相同?

2024-06-10 09:25:46 发布

您现在位置:Python中文网/ 问答频道 /正文

(我相信我已经在stackexchange和internet上找了好几个小时了,但是找不到正确的答案)

我在这里要做的是计算一个文件的行数,我用下面的代码实现了这一点

# Does not loud into memory
def file_len(fname):
with open(fname) as f:
    for i, l in enumerate(f, 1):
        pass
    print(i)

file_len('bigdata.txt')

然后取文件的行数除以2/3/etc(使2/3/etc文件的行数相等)。大数据.txt=1000000行 1000000/2=500000,所以这里我有两个文件,每个文件有500000行,一个从1到500000,另一个从500001到1000000。 我已经有了在原始文件中查找模式的代码(大数据.txt),但我不想找任何图案,只想把它分成两半或其他什么。代码如下:

# Does not loud into memory
with open('bigdata.txt', 'r') as r:
with open('fhalf', 'w') as f:
    for line in r:
        if line == 'pattern\n': # Splits the file when there is an occurence of the pattern.
#But the occurence as you may notice won't be included in either the two files which is not a good thing since I need all the data.
            break
                f.write(line)
with open('shalf.txt', 'w') as f:
    for line in r:
        f.write(line)

所以我在寻找一个简单的解决方案,我知道有一个,只是一时想不出来。 示例将是:file1.txt,file2.txt每个具有相同数字行的给定或获取一个。 谢谢大家抽出时间。


Tags: 文件the代码intxtforaswith
1条回答
网友
1楼 · 发布于 2024-06-10 09:25:46

.readlines()将所有行读入一个列表,然后计算每个文件需要分配多少行,然后开始写入!你知道吗

num_files = 2
with open('bigdata.txt') as in_file:
    lines = in_file.readlines()
    lines_per_file = len(lines) // num_files
    for n in range(num_files):
        with open('file{}.txt'.format(n+1), 'w') as out_file:
            for i in range(n * lines_per_file, (n+1) * lines_per_file):
                out_file.write(lines[i])

以及全面测试:

$ cat bigdata.txt 
line1
line2
line3
line4
line5
line6
$ python -q
>>> num_files = 2
>>> with open('bigdata.txt') as in_file:
...     lines = in_file.readlines()
...     lines_per_file = len(lines) // num_files
...     for n in range(num_files):
...         with open('file{}.txt'.format(n+1), 'w') as out_file:
...             for i in range(n * lines_per_file, (n+1) * lines_per_file):
...                 out_file.write(lines[i])
... 
>>> 
$ more file*
::::::::::::::
file1.txt
::::::::::::::
line1
line2
line3
::::::::::::::
file2.txt
::::::::::::::
line4
line5
line6

如果无法将bigdata.txt读入内存,那么.readlines()解决方案将无法将其剪切。你知道吗

你必须边读边写,这没什么大不了的。你知道吗

至于计算长度,首先,this question讨论了一些方法,我最喜欢的是凯尔的sum()方法。你知道吗

num_files = 2
num_lines = sum(1 for line in open('bigdata.txt'))
lines_per_file = num_lines // num_files
with open('bigdata.txt') as in_file:
    for n in range(num_files):
        with open('file{}.txt'.format(n+1), 'w') as out_file:
            for _ in range(lines_per_file):
                out_file.write(in_file.readline())

相关问题 更多 >