如何同时计算一个大文件中的词频？

chr1 10011 141 0 157 4 41 50 chr1 10012 146 1 158 4 42 51 chr1 10013 150 0 163 4 43 53 chr1 10014 164 3 167 4 44 54

3条回答

网友

1楼 · 编辑于 2024-05-14 07:38:30

一个30gb的文本文件足够大，可以把你的问题放到大数据领域。所以为了解决这个问题，我建议使用像Hadoop和Spark这样的大数据工具。您所解释的“生产者-消费者流”基本上就是MapReduce算法的设计目的。单词计数频率是一个典型的MapReduce问题。查一下，你会发现很多例子。在

网友

2楼 · 编辑于 2024-05-14 07:38:30

这个想法是把大文件分成更小的文件。调用许多将执行计数作业并返回计数器的工作线程。最后合并计数器。在

from itertools import islice
from multiprocessing import Pool
from collections import Counter
import os

NUM_OF_LINES = 3
INPUT_FILE = 'huge.txt'
POOL_SIZE = 10


def slice_huge_file():
    cnt = 0
    with open(INPUT_FILE) as f:
        while True:
            next_n_lines = list(islice(f, NUM_OF_LINES))
            cnt += 1
            if not next_n_lines:
                break
            with open('sub_huge_{}.txt'.format(cnt), 'w') as out:
                out.writelines(next_n_lines)


def count_file_words(input_file):
    with open(input_file, 'r') as f:
        return Counter([w.strip() for w in f.readlines()])


if __name__ == '__main__':
    slice_huge_file()
    pool = Pool(POOL_SIZE)
    sub_files = [os.path.join('.',f) for f in os.listdir('.') if f.startswith('sub_huge')]
    results = pool.map(count_file_words, sub_files)
    final_counter = Counter()
    for counter in results:
        final_counter += counter
    print(final_counter)

网友

3楼 · 编辑于 2024-05-14 07:38:30

我从来没有测试过这个代码，但应该可以工作。在

第一件事是检查行数

f =('myfile.txt')
def file_len(f):
    with open(f) as f:
        for i, l in enumerate(f):
            pass
    return i + 1
num_lines = file_len(f)

将数据拆分为n个分区

^{pr2}$

现在开始工作：

from multiprocessing import Process
import linecache
jobs = []

for part in range(len(parts)):
    p = Process(target = function_here, args = ('myfile.txt', parts[part], split_size))
    jobs.append(p)
    p.start()

for p in jobs:
    p.join()

函数的一个示例

def function_here(your_file_name, line_number, split_size):

    for current_line in range(line_number, (line_number+split_size)+1):
        print( linecache.getline(your_file_name, current_line))

不过，在执行任何操作之前，您仍需要检查行数

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何同时计算一个大文件中的词频？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >