PYTHON:从文本文件中移除词性标记

2 投票
3 回答
2302 浏览
提问于 2025-04-17 18:48

我有一个文本文件,里面每个单词都有一个词性标签(POS,指的是“词性标注”)。

比如说:Needless/jj to/to say/vb ,/, I/ppss was/bedz furious/jj at/in this/dt unparalleled/jj intrusion/nn upon/in free/jj enterprise/nn ./. How/wrb dared/vbn they/ppss

我想知道有没有办法读取这个文件,去掉词性标签,这样结果就变成:

Needless to say , I was furious at this unparalleled intrusion upon free enterprise . How dared they

简单来说,我想去掉每个单词后面的斜杠(/)及其后面的内容。

words = re.findall('\w+',open(input_file).read())

上面的代码可以去掉斜杠,但像 jj、ppss 这样的缩写还是会出现。 所以,我该怎么做才能去掉斜杠后面跟着的任何字符呢?

3 个回答

0

这段代码考虑到了Wooble的评论,以及你需要处理字符串列表的需求,按照我所理解的:

li = [ ('//Needless/jj to/to say/vb ,/, '
        'I/ppss was/bedz fur/ious/jj at/in this/dt '
        'unparalleled/jj intrusion/nn upon/in '
        'free/jj enterprise/nn ./. '
        'How/wrb dared/vbn they/ppss'),
       '/Before/jj to/to say/vb ,/, /I/ppss am/bedz h/a/p/p/y/jj']

import re

def clean(s,r=re.compile('(?<![\s/])/[^\s/]+(?![\S/])')):
    return r.sub('',s)

x = map(clean, li)

print '\n\n'.join(x)

结果

//Needless to say , I was fur/ious at this unparalleled intrusion upon free enterprise . How dared they

/Before to say , /I am h/a/p/p/y
1

正如Wooble所建议的,你可以通过在列表推导式中嵌套两个分割来实现这个功能:

s = 'Needless/jj to/to say/vb ,/, I/ppss was/bedz furious/jj at/in this/dt unparalleled/jj intrusion/nn upon/in free/jj enterprise/nn ./.'
print " ".join(word.split("/")[0] for word in s.split())

输出结果:

Needless to say , I was furious at this unparalleled intrusion upon free enterprise .

s.split() 这个方法会把句子拆分成一个个单独的词。word.split("/") 则是把英文单词(或者标点符号)和它的词性分开。word.split("/")[0] 只选择英文单词,舍弃词性。" ".join() 会把得到的英文单词列表合并成一个完整的字符串。

4

这样可以吗?

>>> import re
>>> s = 'Needless/jj to/to say/vb ,/, I/ppss was/bedz furious/jj at/in this/dt unparalleled/jj intrusion/nn upon/in free/jj enterprise/nn ./.'
>>> re.sub(r'/[^\s]+','',s)
'Needless to say , I was furious at this unparalleled intrusion upon free enterprise .'

这段代码的作用就是删除任何以 / 开头的文本,直到遇到空格为止。

撰写回答