使用Spacy的基于模式的标点符号

2024-05-26 11:54:24 发布

您现在位置:Python中文网/ 问答频道 /正文

作为测试,我使用Spacy在识别span后给文本加标点

import spacy, en_core_web_sm
from spacy.matcher import Matcher

# Read input file
nlp = spacy.load('en_core_web_sm')

matcher = Matcher(nlp.vocab)
Punctuation_patterns = [[{'POS': 'NOUN'},{'POS': 'NOUN'},{'POS': 'NOUN'}],
                        ]

matcher.add('PUNCTUATION', None, *Punctuation_patterns)
doc = nlp("The cat cat cat sat on the mat. The dog sat on the mat.")
matches = matcher(doc)
spans = []
for match_id, start, end in matches:
    span = doc[start:end]  # the matched slice of the doc
    spans.append({'start': span.start_char, 'end': span.end_char})
    layer1 = (' '.join(['"{}"'.format(span.text)if token.dep_ == 'ROOT'  else '{}'.format(token) for token in doc]))
    print (layer1)

输出:

The cat cat cat "cat cat cat" on the mat . The dog "cat cat cat" on the mat .

预期产量

The "cat cat cat" sat on the mat. The dog sat on the mat.

我只是在用ROOT测试,如何使用spacy识别span匹配以获得所需的输出

编辑1:在出现多个检测(如狗)的情况下

for match_id, start, end in matches:
    span = doc[start:end]  # the matched slice of the doc
    spans.append({'start': span.start_char, 'end': span.end_char})
    result = doc.text

for match_id, start, end in matches:
    span = doc[start:end]
    result = result.replace(span.text, f'"{span.text}"', 1)
    print (result)

电流输出:

The "cat cat cat" sat on the mat. The dog dog dog sat on the mat.
The "cat cat cat" sat on the mat. The "dog dog dog" sat on the mat.

预期:

  The "cat cat cat" sat on the mat. The "dog dog dog" sat on the mat.

Tags: thefordocspacyonmatchersatstart
1条回答
网友
1楼 · 发布于 2024-05-26 11:54:24

你可以用

result = doc.text
for match_id, start, end in matches:
    span = doc[start:end]
    result = result.replace(span.text, f'"{span.text}"', 1)
print (result)

也就是说,定义一个变量以保留结果result,并用doc.text值赋值。然后,检查匹配项,并将每个匹配的跨距替换为相同的跨距文本(用双引号括起来)

相关问题 更多 >