印刷缩略语和连字号

2024-05-16 03:53:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要确定我的句子中的所有缩写和连字符的单词开始。它们需要在被识别时打印出来。我的代码似乎不能很好地用于此标识

import re

sentence_stream2=df1['Open End Text']
for sent in sentence_stream2:
    abbs_ = re.findall(r'(?:[A-Z]\.)+', sent) #abbreviations
    hypns_= re.findall(r'\w+(?:-\w+)*', sent) #hyphenated words

    print("new sentence:")
    print(sent)
    print(abbs_)
    print(hypns_)

我的语料库中有一句话是: 带API和;使用云数据分析环境自助BI的事件驱动体系结构

其输出为:

new sentence:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
[]
['DevOps', 'with', 'APIs', 'event-driven', 'architecture', 'using', 'cloud', 'Data', 'Analytics', 'environment', 'Self-service', 'BI']

预期输出为:

new sentence:
DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
['APIs','BI']
['event-driven','Self-service']

Tags: selfreeventnewservicewithsentencedevops
3条回答

您的缩写规则不匹配。如果要查找连续大写字母超过1个的单词,可以使用以下规则:

abbs_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', sent) #abbreviations

这将匹配API和BI

t = "DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI"

import re

abbs_ = re.findall(r'(?:[A-Z]\.)+', t) #abbreviations
cap_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', t) #abbreviations
hypns_= re.findall(r'\w+-\w+', t) #hyphenated words fixed

print("new sentence:")
print(t)
print(abbs_)
print(cap_)
print(hypns_)

输出:

DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
[]  # your abbreviation rule - does not find any capital letter followed by .
['APIs', 'BI'] # cap_ rule
['event-driven', 'Self-service']  # fixed hyphen rule

这很可能找不到像这样的所有缩写

t = "Prof. Dr. S. Quakernack"

所以你可能需要使用更多的数据和f.e.http://www.regex101.com来调整它

我建议:

abbs_ = re.findall(r'\b[A-Z]+s?\b', sent) #abbreviations
hypns_ = re.findall(r'\w+(?:-\w+)*', sent) #hyphenated words

“如你所知,在我的课程中,我得到了一切”

“As”是缩写吗?如果不是,那么您需要丢弃单个大写字母后跟或不后跟Ss,并且只收集至少,可选地后跟一个s,如api中所示。所以

abbs_ = re.findall(r'\b(?:[A-Z][A-Z]+s?)\b', sent) #abbreviations
需要确保你不会因为中间的AG对而收获诸如iNoNaGiRL之类的东西。

然后您必须得到缩写:一个单词(\w+),后跟至少一个连字符单词序列:

hypns_= re.findall(r'\b(?:\\w+(-\w+)+)\b', sent) #hyphenated words

相关问题 更多 >