在大量字符串中消除子元素，前提是没有重复元素

0 投票

1 回答

75 浏览

提问于 2025-04-12 01:27

我有一个很大的字符串列表（大约15000个）。每个字符串在列表中都是独一无二的。所有字符串里面都有一些单词，它们是用点号分开的。现在我需要一个算法来做以下事情：中间的部分要被去掉，只要没有重复的情况出现——一个一个地处理中间部分。

我的代码只在总共有三个元素的时候能正常工作，这样是没问题的。但是当中间部分有多个时，我就不知道该怎么从左到右分别处理每个部分了。

有没有什么好的想法呢？

import pandas as pd

def remove_middle_words(terms):
 df = pd.DataFrame({'terms': terms})
 df[['first_word', 'middle_word', 'last_word']] = df['terms'].str.split('.', expand=True)
 
 unique_first_last = df.groupby(['first_word', 'last_word']).size().reset_index().rename(columns={0:'count'})
 unique_first_last['remove_middle'] = unique_first_last['count'] == 1
 
 df = df.merge(unique_first_last[['first_word', 'last_word', 'remove_middle']], on=['first_word', 'last_word'], how='left')
 df['new_terms'] = df.apply(lambda row: row['terms'] if not row['remove_middle'] else f"{row['first_word']}.{row['last_word']}", axis=1)
 
 return df['new_terms'].tolist()
#case3+4 ok
terms = ['A.B3.C4', 'A.B3.C5', 'A.B4.C6', 'A.B5.C6']
new_terms = remove_middle_words(terms)
print(new_terms)

例子：

案例1（下面的代码不行）：

A.B1.C1.D1 --> A.D1
A.B1.C1.D2 --> A.D2 （B1和C1都可以被去掉）

案例2（下面的代码不行）：

A.B2.C2.D3 --> A.C2.D3
A.B2.C3.D3 --> A.C3.D3 （只有B2可以被去掉，因为如果C2或C3被去掉，A.D3就会重复）

案例3（下面的代码可以）：

A.B3.C4 --> A.C4
A.B3.C5 --> A.C5 （B3可以被去掉）

案例4（下面的代码可以）：

A.B4.C6
A.B5.C6 （什么都不能去掉，因为如果B4或B5被去掉，A.C6就会重复）

案例5（下面的代码不行）

A.B10.C10.D1
A.B20.C10.D1
A.B20.C20.D1 （什么都不能去掉，因为如果去掉B或C的某个部分，就会重复A.D1）

案例6a（下面的代码不行）

A.B100.C100.D100.D1 --> A.B100.D1
A.B200.C100.D100.D1 --> A.B200.D1 （C100和D100可以被去掉，剩下的B部分是独一无二的）

案例6b（下面的代码不行）

A.B300.C200.D100.D1 --> A.C200.D1
A.B300.C300.D100.D1 --> A.C300.D1 （B300和D100可以被去掉，剩下的C部分是独一无二的）

数据结构字符串处理去重算法算法设计字符串分割字符串列表唯一性检查中间部分处理

1 个回答

import pandas as pd
import numpy as np

terms = ['A.B3.C4', 'A.B3.C5', 'A.B4.C6', 'A.B5.C6', 
         'A1.B1.C1.D1', 'A1.B1.C1.D2', "D1",
        "A.B10.C10.D1", "A.B20.C10.D1", "A.B20.C20.D1", 
        "A.B100.C100.D100.D1", "A.B200.C100.D100.D1",
        "A.B300.C200.D100.D1", "A.B300.C300.D100.D1",
        ""    
        ]

def rem_mid_words(txt1):
    if "." in txt1:
        l1 = txt1.split(".")
        txt2 = f"{l1[0]}.{l1[-1]}"   # pos 0 term & . & (-1 =) last term 
        return txt2
    else:
        pass  # string does not contain "."

term_count = {rem_mid_words(txt1): 0 for txt1 in terms}  # initial counting dictionary
for x in terms:
    key = rem_mid_words(x)
    term_count[key] += 1

new_terms = [rem_mid_words(txt1) 
             if term_count[rem_mid_words(txt1)] < 2
             else txt1 for txt1 in terms]
print(new_terms)


# start of adjusted answer

def rem_one_word(txt1, len1, pos):
    # Example "A1.B2.C3.D4.E5"
    # pos       0. 1. 2. 3. 4
    # length    5
    # if len1 is 5 and pos = 1 then remove B2 and return A1.C3.D4.E5
    # otherwise return original txt1
    
    if txt1 != None and "." in txt1:
        l1 = txt1.split(".")
        if len(l1) == len1:
            txt1 = ".".join(l1[:pos] + l1[(1+pos):])
    return txt1

def word_len(txt1):
    if "." in txt1:
        return len(txt1.split("."))
    else: return 0

def shorten_words(terms):   # input list of strings
    # get longest word in terms of "."
    # iterate by reducing length by one and replace if count < 2

    max_len = max([word_len(txt1) for txt1 in terms])
    terms1 = terms
    for len1 in np.arange(max_len, 2, -1):
        print("string length", len1)
        for pos in (1 + np.arange((len1)-2)):
            print("pos", pos)
            temp_words = [rem_one_word(txt1, len1, pos) for txt1 in terms1]
            temp_count = {x: 0 for x in temp_words}
            for x in temp_words: temp_count[x] += 1
            terms2 = [temp_words[i] if temp_count[temp_words[i]] < 2
                      else terms1[i] for i in range(len(terms1))]
            terms1 = terms2
    return terms1

terms1 = shorten_words(terms)

for i in range(len(terms)):
    print(terms[i], "   ", terms1[i])

当然可以！请把你想要翻译的内容发给我，我会帮你把它变得简单易懂。

回答于 2025-04-12 由 Python大师

分享举报

在大量字符串中消除子元素，前提是没有重复元素

1 个回答

撰写回答