如何从NLTK自带的样本语料库中提取单词?
NLTK提供了一些语料库的样本,地址在这里:
http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml我想要的是没有编码的纯文本。我不知道怎么提取这样的内容。我想提取的是:
1) nps_chat:解压后的文件名像是10-19-20s_706posts.xml。这样的文件是XML格式,内容像这样:
<Posts>
<Post class="Statement" user="10-19-20sUser7">now im left with this gay name<terminals>
<t pos="RB" word="now"/>
<t pos="PRP" word="im"/>
<t pos="VBD" word="left"/>
<t pos="IN" word="with"/>
<t pos="DT" word="this"/>
<t pos="JJ" word="gay"/>
<t pos="NN" word="name"/>
</terminals>
</Post>
...
...
我只想要实际的帖子内容:
now im left with this gay name
我该怎么在NLTK或者其他工具中操作,才能把去掉编码后的帖子保存到本地磁盘上呢?
2) switchboard transcript。这种类型的文件(解压后的文件名是discourse)包含以下格式。我想要的是去掉前面的标记:
o A.1 utt1: Okay, /
qy A.1 utt2: have you ever served as a juror? /
ng B.2 utt1: Never. /
sd^e B.2 utt2: I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors. /
b A.3 utt1: Uh-huh. /
sd A.3 utt2: I never have either. /
% B.4 utt1: You haven't, {F huh. } /
...
...
我只想要:
Okay, /
have you ever served as a juror? /
Never. /
I've never been served on the jury, never been called up in a jury, although some of my friends have been jurors. /
Uh-huh. /
I never have either. /
You haven't, {F huh. } /
...
...
非常感谢!
2 个回答
1
你可以使用 .words()
这个属性,它来自于 nltk 这个库。
比如你可以这样写:content = nps_chat.words()
这样做会给你一个包含所有单词的列表。
比如这个列表可能长这样:['now', 'im', 'left', 'with', 'this', 'gay', 'name', ...]
2
首先,你需要为你的语料库创建一个 corpus reader
(语料库读取器)。在 nltk.corpus
中,有一些现成的语料库读取器可以使用,比如:
AlpinoCorpusReader
BNCCorpusReader
BracketParseCorpusReader
CMUDictCorpusReader
CategorizedCorpusReader
CategorizedPlaintextCorpusReader
CategorizedTaggedCorpusReader
ChunkedCorpusReader
ConllChunkCorpusReader
ConllCorpusReader
CorpusReader
DependencyCorpusReader
EuroparlCorpusReader
IEERCorpusReader
IPIPANCorpusReader
IndianCorpusReader
MacMorphoCorpusReader
NPSChatCorpusReader
NombankCorpusReader
PPAttachmentCorpusReader
Pl196xCorpusReader
PlaintextCorpusReader
PortugueseCategorizedPlaintextCorpusReader
PropbankCorpusReader
RTECorpusReader
SensevalCorpusReader
SinicaTreebankCorpusReader
StringCategoryCorpusReader
SwadeshCorpusReader
SwitchboardCorpusReader
SyntaxCorpusReader
TaggedCorpusReader
TimitCorpusReader
ToolboxCorpusReader
VerbnetCorpusReader
WordListCorpusReader
WordNetCorpusReader
WordNetICCorpusReader
XMLCorpusReader
YCOECorpusReader
一旦你像这样创建了一个语料库读取器:
c = nltk.corpus.whateverCorpusReaderYouChoose(directoryWithCorpus, regexForFileTypes)
你就可以通过以下代码从语料库中提取出单词:
paragraphs = [para for para in c.paras()]
for para in paragraphs:
words = [word for sentence in para for word in sentence]
这样你就能得到一个包含你语料库中所有段落的所有单词的列表。
希望这对你有帮助。