如何检测字符串中的相同部分?
我试着把这个关于 解码算法的提问 拆分成更小的问题。这是第一部分。
问题:
- 有两个字符串:s1 和 s2
- s1 的一部分和 s2 的一部分是相同的
- 空格是分隔符
- 怎么提取出相同的部分?
例子 1:
s1 = "12 November 2010 - 1 visitor"
s2 = "6 July 2010 - 100 visitors"
the identical parts are "2010", "-", "1" and "visitor"
例子 2:
s1 = "Welcome, John!"
s2 = "Welcome, Peter!"
the identical parts are "Welcome," and "!"
例子 3:(为了更清楚“!”的例子)
s1 = "Welcome, Sam!"
s2 = "Welcome, Tom!"
the identical parts are "Welcome," and "m!"
优先考虑 Python 和 Ruby。谢谢
4 个回答
1
s1 = "12 November 2010 - 1 visitor"
s2 = "6 July 2010 - 100 visitors"
l1 = s1.split()
for item in l1:
if item in s2:
print item
这个是根据空格来分割的。
如果想要根据单词的边界来分割(比如在例子2中抓住!
),在Python中就不太行,因为re.split()
无法处理零长度的匹配。
第三个例子中,甚至把单词的任何子串都当作可能的匹配,这就让事情变得复杂多了,因为可能的变化太多了(比如对于1234
,我得检查1234
、123
、234
、12
、23
、34
、1
、2
、3
和4
,而且每增加一个数字,可能的组合数量就会成倍增加)。
3
比如说,第一个例子
>>> s1 = 'November 2010 - 1 visitor'
>>> s2 = '6 July 2010 - 100 visitors'
>>>
>>> [i for i in s1.split() if any(j for j in s2.split() if i in j)]
['2010', '-', '1', 'visitor']
>>>
对于两个例子
>>> s1 = "Welcome, John!"
>>> s2 = "Welcome, Peter!"
>>> [i for i in s1.replace('!',' !').split() if any(j for j in s2.replace('!',' !').split() if i in j)]
['Welcome,', '!']
>>>
注意: 上面的代码在第三个例子中是无法使用的,这个例子是提问者刚刚添加的
3
编辑:更新了这个例子,使其适用于所有示例,包括第一个:
def scan(s1, s2):
# Find the longest match where s1 starts with s2
# Returns None if no matches
l = len(s1)
while 1:
if not l:
return None
elif s1[:l] == s2[:l]:
return s1[:l]
else:
l -= 1
def contains(s1, s2):
D = {} # Remove duplicates using a dict
L1 = s1.split(' ')
L2 = s2.split(' ')
# Don't add results which have already
# been processed to satisfy example #1!
DProcessed = {}
for x in L1:
yy = 0
for y in L2:
if yy in DProcessed:
yy += 1
continue
# Scan from the start to the end of the words
a = scan(x, y)
if a:
DProcessed[yy] = None
D[a] = None
break
# Scan from the end to the start of the words
a = scan(x[::-1], y[::-1])
if a:
DProcessed[yy] = None
D[a[::-1]] = None
break
yy += 1
return list(D.keys())
print contains("12 November 2010 - 1 visitor",
"6 July 2010 - 100 visitors")
print contains("Welcome, John!",
"Welcome, Peter!")
print contains("Welcome, Sam!",
"Welcome, Tom!")
输出结果是:
['1', 'visitor', '-', '2010']
['Welcome,', '!']
['Welcome,', 'm!']