在使用正则表达式连接的单词之间添加单个空格和逗号

2024-05-15 07:40:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个嵌套列表\u 3,看起来像:

[['Company OverviewCompany: HowSector: SoftwareYear Founded: 2010One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more togetherUniversity Affiliation(s): Duke$ Raised: $240,000Investors: Friends & familyTraction to Date: 10% of monthly active users (MAU) are also active weekly'], [['Company OverviewCompany: GrubSector: SoftwareYear Founded: 2018One Sentence Pitch: Find food you likeUniversity Affiliation(s): Stanford$ Raised: $340,000Investors: Friends & familyTraction to Date: 40% of monthly active users (MAU) are also active weekly']]]

我想使用正则表达式在每个连接的单词ie(HowSector:,SoftwareYear,2010One)之间添加一个逗号,后跟一个空格。到目前为止,我尝试编写一个re.sub代码,选择所有不带空格的字符并替换它,但遇到了一些问题:


for i, list in enumerate(list_3):
    list_3[i] = [re.sub('r\s\s+', ', ', word) for word in list]
    list_33.append(list_3[i])
print(list_33)

错误:

return _compile(pattern, flags).sub(repl, string, count)

TypeError: expected string or bytes-like object

我希望输出为:

[['Company Overview, Company: How, Sector: Software, Year Founded: 2010, One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together University, Affiliation(s): Duke, $ Raised: $240,000, Investors: Friends & family, Traction to Date: 10% of monthly active users (MAU) are also active weekly'],[...]]

有没有办法用正则表达式来做这个


Tags: oftodateuserssentencecompanylistactive
3条回答

我相信您可以使用以下Python代码

rgx = r'(?<=[a-z\d])([A-Z$][A-Za-z]*(?: +\S+?)*)*:'
rep = r', \1:'
re.sub(rgx, rep, s)

其中s是字符串

Start your engine!Python code

Python的正则表达式引擎在匹配时执行以下操作

(?<=          : begin positive lookbehind
  [a-z\d]     : match a letter or digit
)             : end positive lookbehind
(             : begin capture group 1
  [A-Z$]      : match a capital letter or '$'
  [A-Za-z]*   : match 0+ letters
  (?: +\S+?)  : match 1+ spaces greedily, 1+ non-spaces
                non-greedily in a non-capture group
  *           : execute non-capture group 0+ times
)             : end capture group
:             : match ':'

请注意,可能需要调整捕获组中每个令牌的正向查找和允许字符以满足要求

用于构造替换字符串(, \1:)的正则表达式创建字符串', ',后跟捕获组1的内容,后跟冒号

主要问题是嵌套列表没有固定的级别。有时有两个级别,有时有三个级别。这就是为什么会出现上述错误。在列表有3个级别的情况下,re.sub接收列表作为第三个参数,而不是字符串

第二个问题是您使用的正则表达式不是正确的正则表达式。我们在这里可以使用的最简单的正则表达式应该(至少)能够找到后跟大写字母的非空白字符

在下面的示例代码中,我使用了re.compile(因为同一个正则表达式将被反复使用,我们不妨对它进行预编译并获得一些性能提升),我只是打印输出。您需要找到一种方法以您想要的格式获取输出

regex = re.compile(r'(\S)([A-Z])')
replacement = r'\1, \2'
for inner_list in nested_list:
    for string_or_list in inner_list:
        if isinstance(string_or_list, str):
            print(regex.sub(replacement, string_or_list))
        else:
            for string in string_or_list:
                print(regex.sub(replacement, string))

输出

Company Overview, Company: How, Sector: Software, Year Founded: 2010, One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together, University Affiliation(s): Duke$ Raised: $240,000, Investors: Friends & family, Traction to Date: 10% of monthly active users (, MA, U) are also active weekly
Company Overview, Company: Grub, Sector: Software, Year Founded: 2018, One Sentence Pitch: Find food you like, University Affiliation(s): Stanford$ Raised: $340,000, Investors: Friends & family, Traction to Date: 40% of monthly active users (, MA, U) are also active weekly

如果列表列表是任意深度的,则可以递归遍历它并处理(使用THISregex)字符串,并生成相同的结构:

import re   
from collections.abc import Iterable 

def process(l):
    for el in l:
        if isinstance(el, Iterable) and not isinstance(el, (str, bytes)):
            yield type(el)(process(el))
        else:
            yield ', '.join(re.split(r'(?<=[a-z])(?=[A-Z])', el))   

LoL为例,结果如下:

>>> list(process(LoL))
[['Company Overview, Company: How, Sector: Software, Year Founded: 2010One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together, University Affiliation(s): Duke$ Raised: $240,000Investors: Friends & family, Traction to Date: 10% of monthly active users (MAU) are also active weekly'], [['Company Overview, Company: Grub, Sector: Software, Year Founded: 2018One Sentence Pitch: Find food you like, University Affiliation(s): Stanford$ Raised: $340,000Investors: Friends & family, Traction to Date: 40% of monthly active users (MAU) are also active weekly']]]

相关问题 更多 >