在循环中匹配并分组相似字符串集合

1 投票

2 回答

599 浏览

提问于 2025-04-17 21:31

我正在尝试从一个列表中匹配和分组相似的字符串，但我不知道该怎么做。

我有以下这个列表：

tablenames =[
            'SCS_q104',
            'SCS_q102[{SCS_q102$_$_$SCS_q102_1}].SCS_q102_grid',
            'SCS_q102[{SCS_q102$_$_$SCS_q102_2}].SCS_q102_grid',
            'SCS_q102[{SCS_q102$_$_$SCS_q102_3}].SCS_q102_grid',
            'SCS_q102[{SCS_q102$_$_$SCS_q102_4}].SCS_q102_grid',
            'SCS_q105',
            'SCS_q106',
            'SCS_q107[{SCS_q107$_$_$SCS_q107_1}].SCS_q107_grid',
            'SCS_q107[{SCS_q107$_$_$SCS_q107_2}].SCS_q107_grid',
            'SCS_q107[{SCS_q107$_$_$SCS_q107_3}].SCS_q107_grid',
            'SCS_q108',
            'SCS_q109',
            ]

期望的结果：

groupofgrids = [[
        'SCS_q102[{SCS_q102$_$_$SCS_q102_1}].SCS_q102_grid',
        'SCS_q102[{SCS_q102$_$_$SCS_q102_2}].SCS_q102_grid',
        'SCS_q102[{SCS_q102$_$_$SCS_q102_3}].SCS_q102_grid',
        'SCS_q102[{SCS_q102$_$_$SCS_q102_4}].SCS_q102_grid',
        ][
        'SCS_q107[{SCS_q107$_$_$SCS_q107_1}].SCS_q107_grid',
        'SCS_q107[{SCS_q107$_$_$SCS_q107_2}].SCS_q107_grid',
        'SCS_q107[{SCS_q107$_$_$SCS_q107_3}].SCS_q107_grid',
        ]]

从上面的期望结果中，你可以看到我想要如何分组这些字符串。如果括号前后的内容和之前的字符串相同，那么它们就属于同一组。

在这个例子中，有两个组。

期望的结果只是简单地将匹配的字符串分组，存成列表的列表或者某种字典都没关系。

我到目前为止的尝试：

groupofgrids = []
for item in tablenames:
    if "." in item:
        suffix = item.split(".")[-1]
        if suffix in item:
            groupofgrids.append(item)

print groupofgrids

这个方法并没有像我想的那样将相似的字符串分组，因为我不太确定该怎么做。

有什么建议吗？

列表操作数据结构循环控制字符串匹配相似性算法字典使用字符串分组

2 个回答

这个对你有用吗：

group = dict()

for elm in tablenames:
    try:
        f,s = elm.split('.')
    except:
        pass
    else:
        group.setdefault(s,[])
        group[s].append(elm)

import pprint
pprint.pprint(group.values())

输出结果：

[['SCS_q107[{SCS_q107$_$_$SCS_q107_1}].SCS_q107_grid',
  'SCS_q107[{SCS_q107$_$_$SCS_q107_2}].SCS_q107_grid',
  'SCS_q107[{SCS_q107$_$_$SCS_q107_3}].SCS_q107_grid'],
 ['SCS_q102[{SCS_q102$_$_$SCS_q102_1}].SCS_q102_grid',
  'SCS_q102[{SCS_q102$_$_$SCS_q102_2}].SCS_q102_grid',
  'SCS_q102[{SCS_q102$_$_$SCS_q102_3}].SCS_q102_grid',
  'SCS_q102[{SCS_q102$_$_$SCS_q102_4}].SCS_q102_grid']]

回答于 2025-04-17 由 Python大师

分享举报

因为相似性是基于字符串而不是括号 [...] 之间的内容，所以我们需要提取那些子字符串，用一个分隔符（这里我用了 "-"）把它们连接起来，然后用这个连接后的字符串作为字典的键。
试试这个 -

import re
regex = re.compile(r'(.*?)\[.*?\]\.(.*)')
groupofgrids = {}
for item in tablenames:
    matches = regex.findall(item)
    if (len(matches) > 0 and len(matches[0]) == 2):    
        key = "-".join(matches[0])
        if key in groupofgrids:
            groupofgrids[key].append(item)
        else:
            groupofgrids[key] = [item]
import json
print json.dumps(groupofgrids,sort_keys=True, indent=4)
#OUTPUT
'''
{
    "SCS_q102-SCS_q102_grid": [
        "SCS_q102[{SCS_q102$_$_$SCS_q102_1}].SCS_q102_grid", 
        "SCS_q102[{SCS_q102$_$_$SCS_q102_2}].SCS_q102_grid", 
        "SCS_q102[{SCS_q102$_$_$SCS_q102_3}].SCS_q102_grid", 
        "SCS_q102[{SCS_q102$_$_$SCS_q102_4}].SCS_q102_grid"
    ], 
    "SCS_q107-SCS_q107_grid": [
        "SCS_q107[{SCS_q107$_$_$SCS_q107_1}].SCS_q107_grid", 
        "SCS_q107[{SCS_q107$_$_$SCS_q107_2}].SCS_q107_grid", 
        "SCS_q107[{SCS_q107$_$_$SCS_q107_3}].SCS_q107_grid"
    ]
}
'''

如果你想要一个嵌套列表，那就这样做 -

li = groupofgrids.values()
print json.dumps(li,sort_keys=True, indent=4)
#OUPTUT
'''
[
    [
        "SCS_q107[{SCS_q107$_$_$SCS_q107_1}].SCS_q107_grid", 
        "SCS_q107[{SCS_q107$_$_$SCS_q107_2}].SCS_q107_grid", 
        "SCS_q107[{SCS_q107$_$_$SCS_q107_3}].SCS_q107_grid"
    ], 
    [
        "SCS_q102[{SCS_q102$_$_$SCS_q102_1}].SCS_q102_grid", 
        "SCS_q102[{SCS_q102$_$_$SCS_q102_2}].SCS_q102_grid", 
        "SCS_q102[{SCS_q102$_$_$SCS_q102_3}].SCS_q102_grid", 
        "SCS_q102[{SCS_q102$_$_$SCS_q102_4}].SCS_q102_grid"
    ]
]
'''

回答于 2025-04-17 由 Python大师

分享举报

在循环中匹配并分组相似字符串集合

2 个回答

撰写回答