从就职演说语料库标记.txt文件

网友

1楼 · 编辑于 2024-04-26 03:13:35

word_tokenize需要字符串，但是文件.readlines（）会给你一个清单。只需将列表转换为字符串即可解决问题。在

import nltk
from nltk import word_tokenize
import re

file = open('test.txt', 'r').readlines()
text =''
for line in file:
    text+=line
text = word_tokenize(text)
string = str(text) # remove it if want to tag by words and pass text directly to post_tag:)
nltk.pos_tag(string)

网友

2楼 · 编辑于 2024-04-26 03:13:35

我建议您执行以下操作：

import nltk
# nltk.download('all') # only for the first time when you use nltk
from nltk import word_tokenize
import re

with open('1865-Lincoln.txt') as f: # with - open is recommended for file reading
    lines = f.readlines() # first get all the lines from file, store it
    for i in range(0, len(lines)): # for each line, do the following
        token_text = word_tokenize(lines[i]) # tokenize each line, store in token_text
        print (token_text) # for debug purposes
        pos_tagged_token = nltk.pos_tag(token_text) # pass the token_text to pos_tag()
        print (pos_tagged_token)

对于包含以下内容的文本文件：

user is here
pass is there

结果是：

['user', 'is', 'here']
[('user', 'NN'), ('is', 'VBZ'), ('here', 'RB')]
['pass', 'is', 'there']
[('pass', 'NN'), ('is', 'VBZ'), ('there', 'RB')]

它对我很有用，我使用的是python3.6，如果有必要的话。希望这有帮助！在

编辑1: 所以你的问题是你把一个字符串列表传递给pos_tag()，而doc说

A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word

因此，您需要逐行传递它，即逐串传递。这就是为什么您得到一个TypeError: expected string or bytes-like object错误。在

网友

3楼 · 编辑于 2024-04-26 03:13:35

最有可能的1865-Lincoln.txt是指林肯总统的就职演说。它在NLTK中可以从https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/inaugural.zip获得

文档的原始源来自Inaugural Address Corpus

如果我们检查NLTK is reading the file using LazyCorpusReader，我们会发现文件是拉丁语1编码的。在

inaugural = LazyCorpusLoader(
    'inaugural', PlaintextCorpusReader, r'(?!\.).*\.txt', encoding='latin1')

如果将默认编码设置为utf8，则很可能是TypeError: expected string or bytes-like object出现的地方

你应该用一个显式编码打开文件并正确地解码字符串，即

^{pr2}$

但从技术上讲，您可以在NLTK中将inagural语料库作为语料库对象直接访问，即

>>> from nltk.corpus import inaugural
>>> from nltk import pos_tag
>>> tagged_sents = [pos_tag(sent) for sent in inaugural.sents('1865-Lincoln.txt')]

相关问题更多 >

编程相关推荐

热门问题

热门文章

从就职演说语料库标记.txt文件

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >