PYTHON:从文本文件中移除词性标记
我有一个文本文件,里面每个单词都有一个词性标签(POS,指的是“词性标注”)。
比如说:Needless/jj to/to say/vb ,/, I/ppss was/bedz furious/jj at/in this/dt unparalleled/jj intrusion/nn upon/in free/jj enterprise/nn ./. How/wrb dared/vbn they/ppss
我想知道有没有办法读取这个文件,去掉词性标签,这样结果就变成:
Needless to say , I was furious at this unparalleled intrusion upon free enterprise . How dared they
简单来说,我想去掉每个单词后面的斜杠(/)及其后面的内容。
words = re.findall('\w+',open(input_file).read())
上面的代码可以去掉斜杠,但像 jj、ppss 这样的缩写还是会出现。 所以,我该怎么做才能去掉斜杠后面跟着的任何字符呢?
3 个回答
0
这段代码考虑到了Wooble的评论,以及你需要处理字符串列表的需求,按照我所理解的:
li = [ ('//Needless/jj to/to say/vb ,/, '
'I/ppss was/bedz fur/ious/jj at/in this/dt '
'unparalleled/jj intrusion/nn upon/in '
'free/jj enterprise/nn ./. '
'How/wrb dared/vbn they/ppss'),
'/Before/jj to/to say/vb ,/, /I/ppss am/bedz h/a/p/p/y/jj']
import re
def clean(s,r=re.compile('(?<![\s/])/[^\s/]+(?![\S/])')):
return r.sub('',s)
x = map(clean, li)
print '\n\n'.join(x)
结果
//Needless to say , I was fur/ious at this unparalleled intrusion upon free enterprise . How dared they
/Before to say , /I am h/a/p/p/y
1
正如Wooble所建议的,你可以通过在列表推导式中嵌套两个分割来实现这个功能:
s = 'Needless/jj to/to say/vb ,/, I/ppss was/bedz furious/jj at/in this/dt unparalleled/jj intrusion/nn upon/in free/jj enterprise/nn ./.'
print " ".join(word.split("/")[0] for word in s.split())
输出结果:
Needless to say , I was furious at this unparalleled intrusion upon free enterprise .
s.split()
这个方法会把句子拆分成一个个单独的词。word.split("/")
则是把英文单词(或者标点符号)和它的词性分开。word.split("/")[0]
只选择英文单词,舍弃词性。" ".join()
会把得到的英文单词列表合并成一个完整的字符串。
4
这样可以吗?
>>> import re
>>> s = 'Needless/jj to/to say/vb ,/, I/ppss was/bedz furious/jj at/in this/dt unparalleled/jj intrusion/nn upon/in free/jj enterprise/nn ./.'
>>> re.sub(r'/[^\s]+','',s)
'Needless to say , I was furious at this unparalleled intrusion upon free enterprise .'
这段代码的作用就是删除任何以 /
开头的文本,直到遇到空格为止。