如何使用正则表达式获取字符串中重复多次的模式

3条回答

网友

1楼 · 编辑于 2024-04-20 13:40:47

有趣的要求。注释中解释了代码，这是一种仅使用REGEX的非常快速的解决方案：

import re
# make it more complex
text = "export1/VB European0/NNP export/VB European1/NNP Community1/NNP Community2/NNP French/JJ European2/NNP export/VB European2/NNP"


# 1: First clean app target words word/NNP to word,
#  you can use str.replace but just to show you a technique
# how to to use back reference of the group use \index_of_group
# re.sub(r'/NNP', '', text)
# text.replace('/NNP', '')
_text = re.sub(r'(\w+)/NNP', r'\1', text)

# this pattern strips the leading and trailing spaces
RE_FIND_ALL = r'(?:\s+|^)((?:(?:\s|^)?\w+(?=\s+|$)?)+)(?:\s+|$)'
print('RESULT : ', re.findall(RE_FIND_ALL, _text))

输出：

  RESULT :  ['European0', 'European1 Community1 Community2', 'European2', 'European2']

解释正则表达式：

(?:\s+|^)：跳过前导空格
((?:(?:\s)?\w+(?=\s+|$))+)：捕获一组非连接子组(?:(?:\s)?\w+(?=\s+|$))子组将匹配按空格或行尾折叠的所有序列字。而这场比赛将被全球小组捕获。如果我们不这样做，比赛将只返回第一个字。
(?:\s+|$)：删除序列的尾部空格

我需要从目标单词中删除/NNP，因为您希望将word/NNP的序列保留在一个组中，执行类似这样的操作(word)/NNP (word)/NPP这将在一个组中返回两个元素，但不是作为单个文本，因此通过删除它，文本将是word word，因此REGEX ((?:\w+\s)+)将捕获单词的序列，但它不像这是因为我们需要捕获结尾不包含/sequence_of_letter的单词，不需要循环匹配的组来连接元素以构建有效的文本。你知道吗

注意：如果所有单词都是这种格式word/sequence_of_letters；如果您的单词不是这种格式，这两种解决方案都可以正常工作你得把它们修好。如果要保留它们，请在每个单词的末尾添加/NPP，否则请添加/DUMMY以删除它们。你知道吗

使用re.split但速度较慢，因为我使用list comprehensive修复结果：

import re
# make it more complex
text = "export1/VB Europian0/NNP export/VB Europian1/NNP Community1/NNP Community2/NNP French/JJ Europian2/NNP export/VB Europian2/NNP export/VB export/VB"

RE_SPLIT = r'\w+/[^N]\w+'
result = [x.replace('/NNP', '').strip() for x in re.split(RE_SPLIT, text) if x.strip()]
print('RESULT:  ', result)

网友

2楼 · 编辑于 2024-04-20 13:40:47

IIUC，itertools.groupby更适合这种工作：

from itertools import groupby

def join_token(string_, type_ = 'NNP'):
    res = []
    for k, g in groupby([i.split('/') for i in string_.split()], key=lambda x:x[1]):
        if k == type_:
            res.append(' '.join(i[0] for i in g))
    return res

join_token(tagged_sent_str)

输出：

['European Community', 'European']

如果您希望有三个或更多连续类型，则不需要修改：

str2 = "European/NNP Community/NNP Union/NNP French/JJ European/NNP export/VB" 

join_token(str2)

输出：

['European Community Union', 'European']

网友

3楼 · 编辑于 2024-04-20 13:40:47

你想得到一个模式，但删除了一些部分。你可以用两个连续的正则表达式得到它：

tagged_sent_str = "European/NNP Community/NNP French/JJ European/NNP export/VB"                                      

[ re.sub(r"/NNP","",s) for s in re.findall(r"\w+/NNP(?:\s+\w+/NNP)*",tagged_sent_str) ]                               
['European Community', 'European']

相关问题更多 >

编程相关推荐

热门问题

热门文章