从就职演说语料库标记.txt文件

2024-04-26 03:13:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我很难弄清楚。编码新手。我试着读一个.txt文件,标记它,并对其中的单词进行pos标记。在

到目前为止,我得到的是:

import nltk
from nltk import word_tokenize
import re

file = open('1865-Lincoln.txt', 'r').readlines()
text = word_tokenize(file)
string = str(text)
nltk.pos_tag(string)

我的问题是,它总是给我TypeError: expected string or bytes-like object错误。在


Tags: 文件textfrom标记posimporttxt编码
3条回答

word_tokenize需要字符串,但是文件.readlines()会给你一个清单。 只需将列表转换为字符串即可解决问题。在

import nltk
from nltk import word_tokenize
import re

file = open('test.txt', 'r').readlines()
text =''
for line in file:
    text+=line
text = word_tokenize(text)
string = str(text) # remove it if want to tag by words and pass text directly to post_tag:)
nltk.pos_tag(string)

我建议您执行以下操作:

import nltk
# nltk.download('all') # only for the first time when you use nltk
from nltk import word_tokenize
import re

with open('1865-Lincoln.txt') as f: # with - open is recommended for file reading
    lines = f.readlines() # first get all the lines from file, store it
    for i in range(0, len(lines)): # for each line, do the following
        token_text = word_tokenize(lines[i]) # tokenize each line, store in token_text
        print (token_text) # for debug purposes
        pos_tagged_token = nltk.pos_tag(token_text) # pass the token_text to pos_tag()
        print (pos_tagged_token)

对于包含以下内容的文本文件:

user is here

pass is there

结果是:

['user', 'is', 'here']

[('user', 'NN'), ('is', 'VBZ'), ('here', 'RB')]

['pass', 'is', 'there']

[('pass', 'NN'), ('is', 'VBZ'), ('there', 'RB')]

它对我很有用,我使用的是python3.6,如果有必要的话。希望这有帮助!在

编辑1: 所以你的问题是你把一个字符串列表传递给pos_tag(),而doc说

A part-of-speech tagger, or POS-tagger, processes a sequence of words, and attaches a part of speech tag to each word

因此,您需要逐行传递它,即逐串传递。这就是为什么您得到一个TypeError: expected string or bytes-like object错误。在

最有可能的1865-Lincoln.txt是指林肯总统的就职演说。它在NLTK中可以从https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/inaugural.zip获得

文档的原始源来自Inaugural Address Corpus

如果我们检查NLTK is reading the file using LazyCorpusReader,我们会发现文件是拉丁语1编码的。在

inaugural = LazyCorpusLoader(
    'inaugural', PlaintextCorpusReader, r'(?!\.).*\.txt', encoding='latin1')

如果将默认编码设置为utf8,则很可能是TypeError: expected string or bytes-like object出现的地方

你应该用一个显式编码打开文件并正确地解码字符串,即

^{pr2}$

但从技术上讲,您可以在NLTK中将inagural语料库作为语料库对象直接访问,即

>>> from nltk.corpus import inaugural
>>> from nltk import pos_tag
>>> tagged_sents = [pos_tag(sent) for sent in inaugural.sents('1865-Lincoln.txt')]

相关问题 更多 >