Python 多进程 - 文本处理

3 投票

2 回答

2717 浏览

提问于 2025-04-16 00:13

我正在尝试创建一个多进程版本的文本分类代码，这段代码我在这里找到了（还有其他一些很酷的东西）。我把完整的代码放在下面。

我尝试了几种方法——最开始用的是一个lambda函数，但它报错说不能被序列化（！？），所以我尝试了一个简化版的原始代码：

  negids = movie_reviews.fileids('neg')
  posids = movie_reviews.fileids('pos')

  p = Pool(2)
  negfeats =[]
  posfeats =[]

  for f in negids:
   words = movie_reviews.words(fileids=[f]) 
   negfeats = p.map(featx, words) #not same form as below - using for debugging

  print len(negfeats)

不幸的是，这个也不行——我得到了以下的错误信息：

File "/usr/lib/python2.6/multiprocessing/pool.py", line 148, in map
    return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.6/multiprocessing/pool.py", line 422, in get
    raise self._value
ZeroDivisionError: float division

你觉得我可能哪里出错了？我应该使用pool.apply_async吗？（不过这似乎也没有解决问题——也许我在找错方向）？

import collections
import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

def evaluate_classifier(featx):
    negids = movie_reviews.fileids('neg')
    posids = movie_reviews.fileids('pos')

    negfeats = [(featx(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
    posfeats = [(featx(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

    negcutoff = len(negfeats)*3/4
    poscutoff = len(posfeats)*3/4

    trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
    testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

    classifier = NaiveBayesClassifier.train(trainfeats)
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)

    for i, (feats, label) in enumerate(testfeats):
            refsets[label].add(i)
            observed = classifier.classify(feats)
            testsets[observed].add(i)

    print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
    print 'pos precision:', nltk.metrics.precision(refsets['pos'], testsets['pos'])
    print 'pos recall:', nltk.metrics.recall(refsets['pos'], testsets['pos'])
    print 'neg precision:', nltk.metrics.precision(refsets['neg'], testsets['neg'])
    print 'neg recall:', nltk.metrics.recall(refsets['neg'], testsets['neg'])
    classifier.show_most_informative_features()

代码优化错误处理并行计算序列化文本分类多进程特征选择情感分析

2 个回答

你是在尝试让分类、训练，还是两者都并行处理呢？你可能可以比较容易地让单词计数和评分同时进行，但我不太确定特征提取和训练是否也能这样做。对于分类，我推荐使用execnet。我用它来做并行和分布式的词性标注，效果不错。

execnet的基本思路是，你先训练一个分类器，然后把它发送到每个execnet节点。接下来，把文件分配给每个节点，让它们对收到的文件进行分类。最后，结果会发送回主节点。我还没有尝试过把分类器保存成文件，所以不确定这样是否可行，但如果一个词性标注器可以保存成文件，我想分类器也应该可以。

回答于 2025-04-16 由 Python大师

分享举报

关于你简化后的版本，你是不是用的一个跟http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/里不一样的featx函数？

这个异常很可能是在featx里面发生的，而多线程处理只是把它重新抛出来了，不过它并没有包含原始的错误追踪信息，这让它有点不太好用。

你可以先试着不使用pool.map()来运行（也就是用negfeats = [feat(x) for x in words]这种方式），或者在featx里面加一些可以调试的内容。

如果这样还是不行，建议把你正在做的整个脚本贴在你原来的问题里（如果能简化就更好），这样别人可以运行你的代码，给出更有针对性的回答。注意，下面这段代码其实是可以工作的（是对你简化版本的调整）：

from nltk.corpus import movie_reviews
from multiprocessing import Pool

def featx(words):
    return dict([(word, True) for word in words])

if __name__ == "__main__":
    negids = movie_reviews.fileids('neg')
    posids = movie_reviews.fileids('pos')

    p = Pool(2)
    negfeats =[]
    posfeats =[]

    for f in negids:
        words = movie_reviews.words(fileids=[f]) 
        negfeats = p.map(featx, words)

    print len(negfeats)

回答于 2025-04-16 由 Python大师

分享举报

Python 多进程 - 文本处理

2 个回答

撰写回答