使用Python多处理包解析文本文件时出现的问题

2024-04-24 21:37:08 发布

男 | 程序猿一只，喜欢编程写python代码。

在我的一个研究项目中，我试图从170000个文本文件中计算某些单词。我有一个功能性for循环来完成这项工作，但是看到只有20%的CPU被使用是很痛苦的：

import pandas as pd
import re
def normalize_text(text):
    some process to normalize the text
    return text
# I created a filelist dataframe prior to execute this function
def countwords(filelist):
    global wc
    header_list=['file','wda', 'wdb', 'wdc', 'wdd', 'wde','wdf']
    wc=pd.DataFrame()
    wc = wc.reindex(columns = header_list) 
    for i in range(filelist.shape[0]):
        words = ['wda', 'wdb', 'wdc', 'wdd', 'wde','wdf']
        count={}
        for elem in words:
            count[elem] = 0
        file=open(filelist.iloc[i].at['location'], encoding='latin-1')
        full=file.read()    
        text=normalize_text(full)
        for word in words:
            count[word] = len(re.findall(word,text))
        wc = wc.append(count, sort=False,ignore_index=True)
        wc=wc

我正在尝试修改代码以使用多处理程序包，查看是否可以一次处理多个文件。
我是新来的包，这里是一个修改版本：

import re
def countwords(filedest):
    words = ['wda', 'wdb', 'wdc', 'wdd', 'wde','wdf']
    count={}
    for elem in words:
        count[elem] = 0
    count.update({'file' : filedest})
    file=open(filedest, encoding='latin-1')
    full=file.read()    
    text=normalize_text(full)
    for word in words:
      count[word] = len(re.findall(word,text))
    return count

mydir = os.path.join('C:\\',"filedest\*.txt")
from multiprocessing.pool import ThreadPool
import glob2

if __name__ == '__main__':
  tasks = glob2.glob(str(mydir))
  pool = ThreadPool()
  results=pool.map_async(countwords,tasks)
  pool.close()
  pool.join()
#results are handled after pool.join

我注意到代码被处理的时间不正常，在键盘中断之后，代码继续运行一段时间并完成任务（但是如果我不中断，它将无限期地挂起并爆炸我的内存）。我尝试过在processpool和threadpool之间切换，用另一个函数扭曲结果，但似乎没有任何用处。我正在使用与Spyder IDE的anaconda分发。我感谢你的时间和帮助

Tags： text in import re for count full word

0条回答

目前没有回答

使用Python多处理包解析文本文件时出现的问题

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用Python多处理包解析文本文件时出现的问题

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >