在大量字符串中消除子元素,前提是没有重复元素

0 投票
1 回答
75 浏览
提问于 2025-04-12 01:27

我有一个很大的字符串列表(大约15000个)。每个字符串在列表中都是独一无二的。所有字符串里面都有一些单词,它们是用点号分开的。现在我需要一个算法来做以下事情:中间的部分要被去掉,只要没有重复的情况出现——一个一个地处理中间部分。

我的代码只在总共有三个元素的时候能正常工作,这样是没问题的。但是当中间部分有多个时,我就不知道该怎么从左到右分别处理每个部分了。

有没有什么好的想法呢?

import pandas as pd

def remove_middle_words(terms):
 df = pd.DataFrame({'terms': terms})
 df[['first_word', 'middle_word', 'last_word']] = df['terms'].str.split('.', expand=True)
 
 unique_first_last = df.groupby(['first_word', 'last_word']).size().reset_index().rename(columns={0:'count'})
 unique_first_last['remove_middle'] = unique_first_last['count'] == 1
 
 df = df.merge(unique_first_last[['first_word', 'last_word', 'remove_middle']], on=['first_word', 'last_word'], how='left')
 df['new_terms'] = df.apply(lambda row: row['terms'] if not row['remove_middle'] else f"{row['first_word']}.{row['last_word']}", axis=1)
 
 return df['new_terms'].tolist()
#case3+4 ok
terms = ['A.B3.C4', 'A.B3.C5', 'A.B4.C6', 'A.B5.C6']
new_terms = remove_middle_words(terms)
print(new_terms)

例子:

案例1(下面的代码不行):

  • A.B1.C1.D1 --> A.D1
  • A.B1.C1.D2 --> A.D2 (B1和C1都可以被去掉)

案例2(下面的代码不行):

  • A.B2.C2.D3 --> A.C2.D3
  • A.B2.C3.D3 --> A.C3.D3 (只有B2可以被去掉,因为如果C2或C3被去掉,A.D3就会重复)

案例3(下面的代码可以):

  • A.B3.C4 --> A.C4
  • A.B3.C5 --> A.C5 (B3可以被去掉)

案例4(下面的代码可以):

  • A.B4.C6
  • A.B5.C6 (什么都不能去掉,因为如果B4或B5被去掉,A.C6就会重复)

案例5(下面的代码不行)

  • A.B10.C10.D1
  • A.B20.C10.D1
  • A.B20.C20.D1 (什么都不能去掉,因为如果去掉B或C的某个部分,就会重复A.D1)

案例6a(下面的代码不行)

  • A.B100.C100.D100.D1 --> A.B100.D1
  • A.B200.C100.D100.D1 --> A.B200.D1 (C100和D100可以被去掉,剩下的B部分是独一无二的)

案例6b(下面的代码不行)

  • A.B300.C200.D100.D1 --> A.C200.D1
  • A.B300.C300.D100.D1 --> A.C300.D1 (B300和D100可以被去掉,剩下的C部分是独一无二的)

1 个回答

1
import pandas as pd
import numpy as np

terms = ['A.B3.C4', 'A.B3.C5', 'A.B4.C6', 'A.B5.C6', 
         'A1.B1.C1.D1', 'A1.B1.C1.D2', "D1",
        "A.B10.C10.D1", "A.B20.C10.D1", "A.B20.C20.D1", 
        "A.B100.C100.D100.D1", "A.B200.C100.D100.D1",
        "A.B300.C200.D100.D1", "A.B300.C300.D100.D1",
        ""    
        ]

def rem_mid_words(txt1):
    if "." in txt1:
        l1 = txt1.split(".")
        txt2 = f"{l1[0]}.{l1[-1]}"   # pos 0 term & . & (-1 =) last term 
        return txt2
    else:
        pass  # string does not contain "."

term_count = {rem_mid_words(txt1): 0 for txt1 in terms}  # initial counting dictionary
for x in terms:
    key = rem_mid_words(x)
    term_count[key] += 1

new_terms = [rem_mid_words(txt1) 
             if term_count[rem_mid_words(txt1)] < 2
             else txt1 for txt1 in terms]
print(new_terms)


# start of adjusted answer

def rem_one_word(txt1, len1, pos):
    # Example "A1.B2.C3.D4.E5"
    # pos       0. 1. 2. 3. 4
    # length    5
    # if len1 is 5 and pos = 1 then remove B2 and return A1.C3.D4.E5
    # otherwise return original txt1
    
    if txt1 != None and "." in txt1:
        l1 = txt1.split(".")
        if len(l1) == len1:
            txt1 = ".".join(l1[:pos] + l1[(1+pos):])
    return txt1

def word_len(txt1):
    if "." in txt1:
        return len(txt1.split("."))
    else: return 0

def shorten_words(terms):   # input list of strings
    # get longest word in terms of "."
    # iterate by reducing length by one and replace if count < 2

    max_len = max([word_len(txt1) for txt1 in terms])
    terms1 = terms
    for len1 in np.arange(max_len, 2, -1):
        print("string length", len1)
        for pos in (1 + np.arange((len1)-2)):
            print("pos", pos)
            temp_words = [rem_one_word(txt1, len1, pos) for txt1 in terms1]
            temp_count = {x: 0 for x in temp_words}
            for x in temp_words: temp_count[x] += 1
            terms2 = [temp_words[i] if temp_count[temp_words[i]] < 2
                      else terms1[i] for i in range(len(terms1))]
            terms1 = terms2
    return terms1

terms1 = shorten_words(terms)

for i in range(len(terms)):
    print(terms[i], "   ", terms1[i])

当然可以!请把你想要翻译的内容发给我,我会帮你把它变得简单易懂。

撰写回答