印刷缩略语和连字号

import re sentence_stream2=df1['Open End Text'] for sent in sentence_stream2: abbs_ = re.findall(r'(?:[A-Z]\.)+', sent) #abbreviations hypns_= re.findall(r'\w+(?:-\w+)*', sent) #hyphenated words print("new sentence:") print(sent) print(abbs_) print(hypns_)

new sentence: DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI [] ['DevOps', 'with', 'APIs', 'event-driven', 'architecture', 'using', 'cloud', 'Data', 'Analytics', 'environment', 'Self-service', 'BI']

3条回答

网友

1楼 · 编辑于 2024-05-16 03:53:35

您的缩写规则不匹配。如果要查找连续大写字母超过1个的单词，可以使用以下规则：

abbs_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', sent) #abbreviations

这将匹配API和BI

t = "DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI"

import re

abbs_ = re.findall(r'(?:[A-Z]\.)+', t) #abbreviations
cap_ = re.findall(r'(?:[A-Z]{2,}s?\.?)', t) #abbreviations
hypns_= re.findall(r'\w+-\w+', t) #hyphenated words fixed

print("new sentence:")
print(t)
print(abbs_)
print(cap_)
print(hypns_)

输出：

DevOps with APIs & event-driven architecture using cloud Data Analytics environment Self-service BI
[]  # your abbreviation rule - does not find any capital letter followed by .
['APIs', 'BI'] # cap_ rule
['event-driven', 'Self-service']  # fixed hyphen rule

这很可能找不到像这样的所有缩写

t = "Prof. Dr. S. Quakernack"

所以你可能需要使用更多的数据和f.e.http://www.regex101.com来调整它

网友

2楼 · 编辑于 2024-05-16 03:53:35

我建议：

abbs_ = re.findall(r'\b[A-Z]+s?\b', sent) #abbreviations
hypns_ = re.findall(r'\w+(?:-\w+)*', sent) #hyphenated words

网友

3楼 · 编辑于 2024-05-16 03:53:35

“如你所知，在我的课程中，我得到了一切”

“As”是缩写吗？如果不是，那么您需要丢弃单个大写字母后跟或不后跟Ss，并且只收集至少对，可选地后跟一个s，如api中所示。所以

abbs_ = re.findall(r'\b(?:[A-Z][A-Z]+s?)\b', sent) #abbreviations

需要确保你不会因为中间的AG对而收获诸如iNoNaGiRL之类的东西。
然后您必须得到缩写：一个单词（\w+），后跟至少一个连字符单词序列：
hypns_= re.findall(r'\b(?:\\w+(-\w+)+)\b', sent) #hyphenated words

相关问题更多 >

编程相关推荐

热门问题

热门文章