python正则表达式查找以数字为中心的子字符串

2024-06-17 13:36:13 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一根绳子。我想把这个字符串分成子字符串,子字符串中包含一个包含单词的数字,两边都有(最多)4个单词。如果子串重叠,它们应该合并。你知道吗

Sampletext = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."
re.findall('(\s[*\s]){1,4}\d(\s[*\s]){1,4}', Sampletext)
desired output = ['the way I know 54 how to take praise', 'to take praise for 65 excellent questions 34 thank you for asking']

Tags: theto字符串youfor单词wayhow
1条回答
网友
1楼 · 发布于 2024-06-17 13:36:13

重叠匹配:使用Lookaheads

这样就可以了:

subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."
for match in re.finditer(r"(?=((?:\b\w+\b ){4}\d+(?: \b\w+\b){4}))", subject):
    print(match.group(1))

什么是单词?

输出取决于你对一个词的定义。这里,一句话,我允许数字。这将产生以下输出。你知道吗

输出(允许字中有数字)

the way I know 54 how to take praise
to take praise for 65 excellent questions 34 thank
for 65 excellent questions 34 thank you for asking

选项2:单词中没有数字

subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."    
for match in re.finditer(r"(?=((?:\b[a-z]+\b ){4}\d+(?: \b[a-z]+\b){4}))", subject, re.IGNORECASE):
    print(match.group(1))

输出2

the way I know 54 how to take praise

选项3:扩展到四个不间断的非数字单词

根据您的评论,此选项将扩展到轴的左侧和右侧,直到匹配四个不间断的非数字单词。忽略逗号。你知道吗

subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated. One Two Three Four 55 Extend 66 a b c d AA BB CC DD 71 HH DD, JJ FF"
for match in re.finditer(r"(?=((?:\b[a-z]+[ ,]+){4}(?:\d+ (?:[a-z]+ ){1,3}?)*?\d+.*?(?:[ ,]+[a-z]+){4}))", subject, re.IGNORECASE):
    print(match.group(1))

输出3

the way I know 54 how to take praise
to take praise for 65 excellent questions 34 thank you for asking
One Two Three Four 55 Extend 66 a b c d
AA BB CC DD 71 HH DD, JJ FF

相关问题 更多 >