在大量字符串中消除子元素,前提是没有重复元素
我有一个很大的字符串列表(大约15000个)。每个字符串在列表中都是独一无二的。所有字符串里面都有一些单词,它们是用点号分开的。现在我需要一个算法来做以下事情:中间的部分要被去掉,只要没有重复的情况出现——一个一个地处理中间部分。
我的代码只在总共有三个元素的时候能正常工作,这样是没问题的。但是当中间部分有多个时,我就不知道该怎么从左到右分别处理每个部分了。
有没有什么好的想法呢?
import pandas as pd
def remove_middle_words(terms):
df = pd.DataFrame({'terms': terms})
df[['first_word', 'middle_word', 'last_word']] = df['terms'].str.split('.', expand=True)
unique_first_last = df.groupby(['first_word', 'last_word']).size().reset_index().rename(columns={0:'count'})
unique_first_last['remove_middle'] = unique_first_last['count'] == 1
df = df.merge(unique_first_last[['first_word', 'last_word', 'remove_middle']], on=['first_word', 'last_word'], how='left')
df['new_terms'] = df.apply(lambda row: row['terms'] if not row['remove_middle'] else f"{row['first_word']}.{row['last_word']}", axis=1)
return df['new_terms'].tolist()
#case3+4 ok
terms = ['A.B3.C4', 'A.B3.C5', 'A.B4.C6', 'A.B5.C6']
new_terms = remove_middle_words(terms)
print(new_terms)
例子:
案例1(下面的代码不行):
- A.B1.C1.D1 --> A.D1
- A.B1.C1.D2 --> A.D2 (B1和C1都可以被去掉)
案例2(下面的代码不行):
- A.B2.C2.D3 --> A.C2.D3
- A.B2.C3.D3 --> A.C3.D3 (只有B2可以被去掉,因为如果C2或C3被去掉,A.D3就会重复)
案例3(下面的代码可以):
- A.B3.C4 --> A.C4
- A.B3.C5 --> A.C5 (B3可以被去掉)
案例4(下面的代码可以):
- A.B4.C6
- A.B5.C6 (什么都不能去掉,因为如果B4或B5被去掉,A.C6就会重复)
案例5(下面的代码不行)
- A.B10.C10.D1
- A.B20.C10.D1
- A.B20.C20.D1 (什么都不能去掉,因为如果去掉B或C的某个部分,就会重复A.D1)
案例6a(下面的代码不行)
- A.B100.C100.D100.D1 --> A.B100.D1
- A.B200.C100.D100.D1 --> A.B200.D1 (C100和D100可以被去掉,剩下的B部分是独一无二的)
案例6b(下面的代码不行)
- A.B300.C200.D100.D1 --> A.C200.D1
- A.B300.C300.D100.D1 --> A.C300.D1 (B300和D100可以被去掉,剩下的C部分是独一无二的)
1 个回答
1
import pandas as pd
import numpy as np
terms = ['A.B3.C4', 'A.B3.C5', 'A.B4.C6', 'A.B5.C6',
'A1.B1.C1.D1', 'A1.B1.C1.D2', "D1",
"A.B10.C10.D1", "A.B20.C10.D1", "A.B20.C20.D1",
"A.B100.C100.D100.D1", "A.B200.C100.D100.D1",
"A.B300.C200.D100.D1", "A.B300.C300.D100.D1",
""
]
def rem_mid_words(txt1):
if "." in txt1:
l1 = txt1.split(".")
txt2 = f"{l1[0]}.{l1[-1]}" # pos 0 term & . & (-1 =) last term
return txt2
else:
pass # string does not contain "."
term_count = {rem_mid_words(txt1): 0 for txt1 in terms} # initial counting dictionary
for x in terms:
key = rem_mid_words(x)
term_count[key] += 1
new_terms = [rem_mid_words(txt1)
if term_count[rem_mid_words(txt1)] < 2
else txt1 for txt1 in terms]
print(new_terms)
# start of adjusted answer
def rem_one_word(txt1, len1, pos):
# Example "A1.B2.C3.D4.E5"
# pos 0. 1. 2. 3. 4
# length 5
# if len1 is 5 and pos = 1 then remove B2 and return A1.C3.D4.E5
# otherwise return original txt1
if txt1 != None and "." in txt1:
l1 = txt1.split(".")
if len(l1) == len1:
txt1 = ".".join(l1[:pos] + l1[(1+pos):])
return txt1
def word_len(txt1):
if "." in txt1:
return len(txt1.split("."))
else: return 0
def shorten_words(terms): # input list of strings
# get longest word in terms of "."
# iterate by reducing length by one and replace if count < 2
max_len = max([word_len(txt1) for txt1 in terms])
terms1 = terms
for len1 in np.arange(max_len, 2, -1):
print("string length", len1)
for pos in (1 + np.arange((len1)-2)):
print("pos", pos)
temp_words = [rem_one_word(txt1, len1, pos) for txt1 in terms1]
temp_count = {x: 0 for x in temp_words}
for x in temp_words: temp_count[x] += 1
terms2 = [temp_words[i] if temp_count[temp_words[i]] < 2
else terms1[i] for i in range(len(terms1))]
terms1 = terms2
return terms1
terms1 = shorten_words(terms)
for i in range(len(terms)):
print(terms[i], " ", terms1[i])
当然可以!请把你想要翻译的内容发给我,我会帮你把它变得简单易懂。