使用NLTK和Python从文本文件读取和写入POS标记句子

1 投票

2 回答

7238 浏览

提问于 2025-04-16 15:14

有没有人知道有没有现成的模块或者简单的方法，可以把带有词性标记的句子读写到文本文件里？我在用Python和自然语言工具包（NLTK）。比如，这段代码：

import nltk

sentences = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world."

tagged = nltk.sent_tokenize(sentences.strip())
tagged = [nltk.word_tokenize(sent) for sent in tagged]
tagged = [nltk.pos_tag(sent) for sent in tagged]

print tagged

会返回这样一个嵌套列表：

[[('Call', 'NNP'), ('me', 'PRP'), ('Ishmael', 'NNP'), ('.', '.')], [('Some', 'DT'), ('years', 'NNS'), ('ago', 'RB'), ('-', ':'), ('never', 'RB'), ('mind', 'VBP'), ('how', 'WRB'), ('long', 'JJ'), ('precisely', 'RB'), ('-', ':'), ('having', 'VBG'), ('little', 'RB'), ('or', 'CC'), ('no', 'DT'), ('money', 'NN'), ('in', 'IN'), ('my', 'PRP$'), ('purse', 'NN'), (',', ','), ('and', 'CC'), ('nothing', 'NN'), ('particular', 'JJ'), ('to', 'TO'), ('interest', 'NN'), ('me', 'PRP'), ('on', 'IN'), ('shore', 'NN'), (',', ','), ('I', 'PRP'), ('thought', 'VBD'), ('I', 'PRP'), ('would', 'MD'), ('sail', 'VB'), ('about', 'IN'), ('a', 'DT'), ('little', 'RB'), ('and', 'CC'), ('see', 'VB'), ('the', 'DT'), ('watery', 'NN'), ('part', 'NN'), ('of', 'IN'), ('the', 'DT'), ('world', 'NN'), ('.', '.')]]

我知道我可以很简单地把它存成一个pickle文件，但我其实想把它导出为一个更大文本文件的一部分。我希望能把这个列表导出到一个文本文件里，然后以后再回来，解析它，恢复原来的列表结构。NLTK里面有没有内置的函数可以做到这一点？我找过，但没找到...

示例输出：

<headline>Article headline</headline>
<body>Call me Ishmael...</body>
<pos_tags>[[('Call', 'NNP'), ('me', 'PRP'), ('Ishmael', 'NNP')...</pos_tags>

文本处理自然语言处理数据持久化 nltk 语言模型文件读写 pos tagging

2 个回答

NLTK有一种标准的文件格式，用于标记文本。它的样子是这样的：

Call/NNP me/PRP Ishmael/NNP ./.

你应该使用这种格式，因为它可以让你用NLTK的 TaggedCorpusReader 和其他类似的类来读取你的文件，并且可以使用所有的语料库读取功能。让人困惑的是，NLTK并没有提供一个高级的函数来写入这种格式的标记语料库，但这可能是因为这件事其实很简单：

for sent in tagged:
    print " ".join(word+"/"+tag for word, tag in sent)

(NLTK确实提供了 nltk.tag.tuple2str()，但它只处理一个单词——直接输入 word+"/"+tag 更简单)。

如果你把标记文本保存到一个或多个文件 fileN.txt 中，使用这种格式，你可以用 nltk.corpus.reader.TaggedCorpusReader 这样读取它：

mycorpus = nltk.corpus.reader.TaggedCorpusReader("path/to/corpus", "file.*\.txt")
print mycorpus.fileids()
print mycorpus.sents()[0]
for sent in mycorpus.tagged_sents():
    <etc>

注意，sents() 方法会给你未标记的文本，虽然排版有点奇怪。文件中不需要同时包含标记和未标记的版本，就像你示例中的那样。

TaggedCorpusReader 不支持文件头（比如标题等），但如果你真的需要这些信息，你可以自己创建一个类，读取文件的元数据，然后像 TaggedCorpusReader 一样处理其余的内容。

回答于 2025-04-16 由 Python大师

分享举报

看起来你可以使用pickle.dumps这个方法，把它的输出结果放到你的文本文件里，可能还可以加个标签来方便自动加载，这样就能满足你的需求了。

你能更具体一点说说你希望文本输出是什么样子的吗？你是想要更容易让人看懂的内容吗？

编辑：添加一些代码

from xml.dom.minidom import Document, parseString
import nltk

sentences = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world."

tagged = nltk.sent_tokenize(sentences.strip())
tagged = [nltk.word_tokenize(sent) for sent in tagged]
tagged = [nltk.pos_tag(sent) for sent in tagged]

# Write to xml string
doc = Document()

base = doc.createElement("Document")
doc.appendChild(base)

headline = doc.createElement("headline")
htext = doc.createTextNode("Article Headline")
headline.appendChild(htext)
base.appendChild(headline)

body = doc.createElement("body")
btext = doc.createTextNode(sentences)
headline.appendChild(btext)
base.appendChild(body)

pos_tags = doc.createElement("pos_tags")
tagtext = doc.createTextNode(repr(tagged))
pos_tags.appendChild(tagtext)
base.appendChild(pos_tags)

xmlstring = doc.toxml()

# Read back tagged

doc2 = parseString(xmlstring)
el = doc2.getElementsByTagName("pos_tags")[0]
text = el.firstChild.nodeValue
tagged2 = eval(text)

print "Equal? ", tagged == tagged2

回答于 2025-04-16 由 Python大师

分享举报

使用NLTK和Python从文本文件读取和写入POS标记句子

2 个回答

撰写回答