在EntityRu中对短语模式使用RegEx

from spacy.lang.en import English from spacy.pipeline import EntityRuler nlp = English() ruler = EntityRuler(nlp) patterns = [{"label": "FRT", "pattern": [{'REGEX': "[Aa]ppl[e|es])"}]}, {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}] ruler.add_patterns(patterns) nlp.add_pipe(ruler) doc = nlp(u"Apple is red. Granny Smith apples are green.") print([(ent.text, ent.label_) for ent in doc.ents])

2条回答

网友

1楼 · 编辑于 2024-04-25 00:17:21

您错过了试图在正则表达式中匹配的顶级令牌属性。由于缺少top lever token属性，因此忽略REGEX键，并将模式解释为“any token”

工作代码

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "FRT", "pattern": [{'TEXT' : {'REGEX': "[Aa]ppl[e|es]"}}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]

ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp(u"Apple is red. Granny Smith apples are green.")
print([(ent.text, ent.label_) for ent in doc.ents])

输出

[('Apple', 'FRT'), ('Granny Smith', 'BRN'), ('apples', 'FRT')]

事实上，您还可以为apple使用下面的模式

{"label": "FRT", "pattern": [{'LOWER' : {'REGEX': "appl[e|es]"}}]}

网友

2楼 · 编辑于 2024-04-25 00:17:21

您需要使用以下patterns声明修复整个代码：

patterns = [{"label": "FRT", "pattern": [{"TEXT" : {"REGEX": "[Aa]pples?"}}]},
            {"label": "BRN", "pattern": [{"LOWER": "granny"}, {"LOWER": "smith"}]}]

有两件事：1）如果不在TEXT、LOWER等下定义，REGEX运算符本身就不起作用。top-level token和2）当使用字符类而不是分组构造时，您使用的正则表达式已损坏。在

注意，[e|es]是regex character class，与e、s或{}匹配。因此，如果有一个Appl| is red.字符串，结果将包含[('Appl|', 'FRT')。您需要使用non-capturing group-(?:es|s)，或者只使用匹配e的es?，然后使用可选的s。在

同样，请参考以下场景：

[{"TEXT" : {"REGEX": "[Aa]pples?"}}]将找到Apple、apple、Apples、apples，但找不到{}
[{"LOWER" : {"REGEX": "apples?"}}]将找到Apple、apple、Apples、apples、APPLES、aPPleS等。而且还stapples（一个staples的拼写错误）
[{"TEXT" : {"REGEX": r"\b[Aa]pples?\b"}}]将找到Apple、apple、Apples、apples，但找不到{}，或stapples，因为\b是单词边界。在

工作代码

相关问题更多 >

编程相关推荐

热门问题

热门文章