从符号'<>'和嵌套大小写'<>>'之间的句子中提取单词术语

2024-04-20 07:28:58 发布

您现在位置:Python中文网/ 问答频道 /正文

命名实体识别新闻数据集(文本)

以下是一个示例:

<LOC Qatar> and <LOC Japan>, who met in the <EVENT <S Asian> <E Cup>> final in <DATE February>, are in third place in their groups.

我试图提取介于<>;,嵌套标签和输出中的问题是:

['<LOC Qatar>',
 '<LOC Japan>',
 '<EVENT <S Asian>',
 '<E Cup>',
 '<DATE February>']

这是错误的,因为“亚洲事件”,“E杯”应该是一个字符串而不是两个

我试过regEx,但效果不好

import re
s = """<LOC Qatar> and <LOC Japan>, 
who met in the <EVENT <S Asian> <E Cup>> final in <DATE February>, are in third place in their groups."""
re.findall('\<.*?\>',s)

实际结果:

['<LOC Qatar>',
 '<LOC Japan>',
 '<EVENT <S Asian>',
 '<E Cup>',
 '<DATE February>']

预期结果:

['<LOC Qatar>',
 '<LOC Japan>',
 '<EVENT <S Asian> <E Cup>>',
 '<DATE February>']

Tags: andtheineventdatelocarefinal
1条回答
网友
1楼 · 发布于 2024-04-20 07:28:58

您希望应用注释中提到的递归模式。regex模块为您提供机会(而不是re模块)

代码如下:

# Import module
import regex as reg

# Your string
s = """<LOC Qatar> and <LOC Japan>, 
who met in the < EVENT < S Asian > < E Cup >> final in < DATE February > , are in third place in their groups. """

# Match pattern
my_list = reg.findall("<((?:[^<>]|(?R))*)>", s)
print(my_list)
# ['LOC Qatar', 'LOC Japan', ' EVENT < S Asian > < E Cup >', ' DATE February ']

如果您真的希望单词被<>包围,您可以添加它们:

my_list = ['<' + elt + '>' for elt in my_list]
print(my_list)
# ['<LOC Qatar>', '<LOC Japan>', '< EVENT < S Asian > < E Cup >>', '< DATE February >']

相关问题 更多 >