如何提高POS n-grams的效果?
我正在用支持向量机(SVM)做文本分类,使用词性标注的n-gram作为特征。但是,处理POS单字(unigram)花了我整整两个小时。我有5000篇文本,每篇文本大约有300个单词。以下是我的代码:
def posNgrams(s,n):
'''Calculate POS n-grams and return a dictionary'''
text = nltk.word_tokenize(s)
text_tags = nltk.pos_tag(text)
taglist = []
output = {}
for item in text_tags:
taglist.append(item[1])
for i in xrange(len(taglist)-n+1):
g = ' '.join(taglist[i:i+n])
output.setdefault(g,0)
output[g] += 1
return output
我尝试用同样的方法处理字符n-gram,结果只花了几分钟。你能给我一些建议,怎么让我的POS n-gram处理得更快吗?
1 个回答
1
使用一台配置如下的服务器,来自 inxi -C
:
CPU(s): 2 Hexa core Intel Xeon CPU E5-2430 v2s (-HT-MCP-SMP-) cache: 30720 KB flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx)
Clock Speeds: 1: 2500.036 MHz
通常,标准的答案是使用批量标记功能 pos_tag_sents
,但似乎它并没有更快。
我们来尝试在获取词性标签之前,分析一下某些步骤的性能(只使用1个核心):
import time
from nltk.corpus import brown
from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk import pos_tag_sents
# Load brown corpus
start = time.time()
brown_corpus = brown.raw()
loading_time = time.time() - start
print "Loading brown corpus took", loading_time
# Sentence tokenizing corpus
start = time.time()
brown_sents = sent_tokenize(brown_corpus)
sent_time = time.time() - start
print "Sentence tokenizing corpus took", sent_time
# Word tokenizing corpus
start = time.time()
brown_words = [word_tokenize(i) for i in brown_sents]
word_time = time.time() - start
print "Word tokenizing corpus took", word_time
# Loading, sent_tokenize, word_tokenize all together.
start = time.time()
brown_words = [word_tokenize(s) for s in sent_tokenize(brown.raw())]
tokenize_time = time.time() - start
print "Loading and tokenizing corpus took", tokenize_time
# POS tagging one sentence at a time took.
start = time.time()
brown_tagged = [pos_tag(word_tokenize(s)) for s in sent_tokenize(brown.raw())]
tagging_time = time.time() - start
print "Tagging sentence by sentence took", tagging_time
# Using batch_pos_tag.
start = time.time()
brown_tagged = pos_tag_sents([word_tokenize(s) for s in sent_tokenize(brown.raw())])
tagging_time = time.time() - start
print "Tagging sentences by batch took", tagging_time
[输出]:
Loading brown corpus took 0.154870033264
Sentence tokenizing corpus took 3.77206301689
Word tokenizing corpus took 13.982845068
Loading and tokenizing corpus took 17.8847839832
Tagging sentence by sentence took 1114.65085101
Tagging sentences by batch took 1104.63432097
注意:在NLTK3.0之前,pos_tag_sents
被称为 batch_pos_tag
总的来说,我认为你需要考虑使用其他的词性标注工具来预处理你的数据,或者你需要使用 threading
来处理词性标签。