2024-05-16 15:35:44 发布
网友
我知道Treebank语料库已经被标记了,但是与Brown语料库不同,我不知道如何获得标签字典。例如
>>> from nltk.corpus import brown >>> wordcounts = nltk.ConditionalFreqDist(brown.tagged_words())
这对树库语料库无效吗?在
快速解决方案:
>>> from nltk.corpus import treebank >>> from nltk import ConditionalFreqDist as cfd >>> from itertools import chain >>> treebank_tagged_words = list(chain(*list(chain(*[[tree.pos() for tree in treebank.parsed_sents(pf)] for pf in treebank.fileids()])))) >>> wordcounts = cfd(treebank_tagged_words) >>> treebank_tagged_words[0] (u'Pierre', u'NNP') >>> wordcounts[u'Pierre'] FreqDist({u'NNP': 1}) >>> treebank_tagged_words[100] (u'asbestos', u'NN') >>> wordcounts[u'asbestos'] FreqDist({u'NN': 11})
有关详细信息,请参见https://en.wikipedia.org/wiki/User:Alvations/NLTK_cheatsheet/CorporaReaders#Penn_Tree_Bank
另请参见:Is there a way of avoiding so many list(chain(*list_of_list))?
注意,NLTK的Penn Treebank样本中只有3000多个句子,brown语料库有50000个句子。在
将句子分成训练集和测试集:
如果要使用brown语料库(不包含已解析的句子),可以使用tagged_sent():
tagged_sent()
>>> from nltk.corpus import brown >>> brown_tagged_sents = brown.tagged_sents() >>> len(brown_tagged_sents) 57340 >>> brown_tagged_sents[0] [(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')] >>> total_len = len(brown_tagged_sents) >>> train_len = int(90 * total_len/100) >>> train_set = brown_tagged_sents[:train_len] >>> train_brown_tagged_words = cfd(chain(*train_set)) >>> train_brown_tagged_words['asbestos'] FreqDist({u'NN': 1})
正如@alexis所说,除非你在句子层面上拆分语料库。在NLTK的Penn Treebank API中也存在tagged_words()函数:
tagged_words()
>>> from nltk.corpus import treebank >>> from nltk.corpus import brown >>> treebank.tagged_words() [(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), ...] >>> brown.tagged_words() [(u'The', u'AT'), (u'Fulton', u'NP-TL'), ...] >>> type(treebank.tagged_words()) <class 'nltk.corpus.reader.util.ConcatenatedCorpusView'> >>> type(brown.tagged_words()) <class 'nltk.corpus.reader.util.ConcatenatedCorpusView'> >>> from nltk import ConditionalFreqDist as cfd >>> cfd(brown.tagged_words()) <ConditionalFreqDist with 56057 conditions> >>> cfd(treebank.tagged_words()) <ConditionalFreqDist with 12408 conditions>
快速解决方案:
有关详细信息,请参见https://en.wikipedia.org/wiki/User:Alvations/NLTK_cheatsheet/CorporaReaders#Penn_Tree_Bank
另请参见:Is there a way of avoiding so many list(chain(*list_of_list))?
注意,NLTK的Penn Treebank样本中只有3000多个句子,brown语料库有50000个句子。在
将句子分成训练集和测试集:
^{pr2}$如果要使用brown语料库(不包含已解析的句子),可以使用
tagged_sent()
:正如@alexis所说,除非你在句子层面上拆分语料库。在NLTK的Penn Treebank API中也存在
tagged_words()
函数:相关问题 更多 >
编程相关推荐