不切词的最长公共子串 - Python

3 投票
9 回答
3904 浏览
提问于 2025-04-18 00:23

给定以下内容,我可以找到最长的公共子串:

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

print longest_common_substring(s1, s2)

[输出]:

foo bar

但是我该如何确保最长的公共子串遵循英语单词的边界,而不把一个单词切开呢?比如,以下这些句子:

s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)

输出的结果是这样的,这并不是我想要的,因为它把单词 kappa 从 s2 中切开了:

a foo bar

我想要的输出仍然是:

foo bar

我也尝试过使用 ngram 的方法来获取遵循单词边界的最长公共子串,但有没有其他方法可以处理字符串,而不需要计算 ngrams?(见答案)

9 个回答

1

你只需要在单词的开头和结尾加上检查。

然后,你只在有效的匹配结束时更新 m

就像这样:

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    # current character in s1
    x_char = s1[x - 1]
    # we are at the beginning of a word in s1 if
    #   (we are at the beginning of s1) or 
    #   (previous character is a space)
    x_word_begin = (x == 1) or (s1[x - 2] == " ")
    # we are at the end of a word in s1 if
    #   (we are at the end of s1) or 
    #   (next character is a space)
    x_word_end = (x == len(s1)) or (s1[x] == " ")
    for y in xrange(1, 1 + len(s2)):
      # current character in s2
      y_char = s2[y - 1]
      # we are at the beginning of a word in s2 if
      #   (we are at the beginning of s2) or 
      #   (previous character is a space)
      y_word_begin = (y == 1) or (s2[y - 2] == " ")
      # we are at the end of a word in s2 if
      #   (we are at the end of s2) or 
      #   (next character is a space)
      y_word_end = (y == len(s2)) or (s2[y] == " ")
      if x_char == y_char:
        # no match starting with x_char
        if m[x - 1][y - 1] == 0:
          # a match can start only with a space
          #   or at the beginning of a word
          if x_char == " " or (x_word_begin and y_word_begin):
              m[x][y] = m[x - 1][y - 1] + 1
        else:
          m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          # the match can end only with a space
          #   or at the end of a word
          if x_char == " " or (x_word_end and y_word_end):
            longest = m[x][y]
            x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]
1

这个问题比我最开始想的要有趣得多。仔细想想,其实有四种可能的结果。

  1. 简单的情况,整个字符串都匹配,没有边界(你的第一个例子)
  2. 在开头跨越一个单词边界(你的第二个例子)
  3. 在结尾跨越一个单词边界
  4. 在两端都有单词边界

现在你的代码已经处理了简单的情况,所以我们可以利用这一点;剩下的就是在结果周围加上一些检查,来处理其他情况。那么这些检查应该是什么样的呢?我们来看一下你的失败案例:

string 1 = "this is a foo bar sentence ."
string 2 = "what a kappa foo bar black sheep ?"
output string = "a foo bar"

从字符串 find 的角度来看,我们可以string1string2 中按顺序找到所有这些字母,但如果我们把空格周围的内容分开成列表,并且只按顺序查找这些列表,只有 string1 会匹配。

我主要是做C语言的,所以我想把这个写成一个函数:

def full_string(str1, str2, chkstr):
  l1 = str1.split()
  l2 = str2.split()
  chkl = chkstr.split()
  return (any(l1[i:i+len(chkl)]==chkl for i in xrange(len(l1)-len(chkl)+1)) and
          any(l2[i:i+len(chkl)]==chkl for i in xrange(len(l2)-len(chkl)+1)))

通过这个函数,我们可以检查这两个字符串中的任意一个是否不包含我们从 longest_common_substring(s1, s2) 得到的结果中的所有单词,并且是按顺序的。太好了。那么最后一步就是把这两个函数结合起来,检查上面列出的四种情况:

def longest_whole_substring(s1, s2):
  subs = longest_common_substring(s1, s2)
  if not full_string(s1, s2, subs):
    if full_string(s1, s2, ' '.join(subs.split()[1:])):
      subs = ' '.join(subs.split()[1:])
    elif full_string(s1, s2, ' '.join(subs.split()[:-1])):
      subs = ' '.join(subs.split()[:-1])
    else:
      subs = ' '.join(subs.split()[1:-1])
  return subs

现在函数 longest_whole_substring(s1, s2) 将提供最长的完整子字符串,不会切掉任何单词。让我们在每种情况下测试一下:

简单情况:

>>> a = 'this is a foo bar bar foo string'
>>> b = 'foo bar'
>>> 
>>> longest_whole_substring(a,b)
'foo bar'

开头有单词边界:

>>> b = 's a foo bar'
>>> 
>>> longest_whole_substring(a,b)
'a foo bar '

结尾有单词边界:

>>> b = 'foo bar f'
>>> 
>>> longest_whole_substring(a,b)
'foo bar'

两端都有单词边界:

>>> b = 's a foo bar f'
>>> 
>>> longest_whole_substring(a,b)
'a foo bar'

看起来不错!

1

只需要在你的代码里加一个接受条件就行:

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest and word_aligned(x, y, m[x][y]):  # acceptance condition
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def word_aligned(x, y, length):
    """check that a match starting at s1[x - 1] and s2[y - 1] is aligned on a word boundary"""
    # check start of match in s1
    if s1[x - 1].isspace():
        # match doesn't start with a character, reject
        return False
    if x - 2 > 0 and not s1[x - 2].isspace():
        # char before match is not start of line or space, reject
        return False
    # check start of match in s2
    ... same as above ...
    # check end of match in s1
    ... your code is a bit hard for me follow, what is end of match? ...
    # check end of match in s2
    ... same as above ...
    return True

print longest_common_substring(s1, s2)
1

我的回答并不是基于任何官方资料,而只是一个简单的观察:至少在我的安装环境中,你的LCS函数在处理(s1, s2)和(s1, s3)这对字符串时,输出结果是有区别的。

In [1]: s1 = "this is a foo bar sentence ."

In [3]: s2 = "what the foo bar blah blah black sheep is doing ?"

In [4]: s3 = "what a kappa foo bar black sheep ?"

In [12]: longest_common_substring(s1, s3)
Out[12]: 'a foo bar '

In [13]: longest_common_substring(s1, s2)
Out[13]: ' foo bar '

你可能会注意到,如果完整的单词匹配成功,那么周围的空格也会被匹配上。

你可以在返回输出之前,修改这个函数,像这样:

answer = s1[x_longest - longest: x_longest]
if not (answer.startswith(" ") and answer.endswith(" ")):
    return longest_common_substring(s1, answer[1:])
else:
    return answer

我相信还有其他一些特殊情况,比如子字符串出现在字符串的末尾,递归调用函数时使用s2,是否要去掉answer的前面或后面的空格等等——但至少在你展示的这些情况中,这个简单的修改可以达到你想要的效果:

In [20]: longest_common_substring(s1, s3)
Out[20]: ' foo bar '

你觉得这个方向值得进一步探索吗?

10

这个内容太简单了,容易理解。我用你的代码完成了75%的工作。首先,我把句子拆分成单词,然后把这些单词传给你的函数,这样就能得到最大的公共子串(在这个例子中就是最长的连续单词)。你的函数给我返回了['foo', 'bar'],我把这个数组里的元素连接起来,就得到了想要的结果。

这里有一个在线的工作副本,你可以测试、验证和随意修改。

http://repl.it/RU0/1

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def longest_common_sentence(s1, s2):
    s1_words = s1.split(' ')
    s2_words = s2.split(' ')  
    return ' '.join(longest_common_substring(s1_words, s2_words))


s1 = 'this is a foo bar sentence .'
s2 = 'what a kappa foo bar black sheep ?'
common_sentence = longest_common_sentence(s1, s2)
print common_sentence
>> 'foo bar'

边界情况

  1. 在你的例子中,像'.'和'?'这样的符号也被当作有效的单词处理,前提是它们和最后一个单词之间有空格。如果没有空格,它们就会被算作最后一个单词的一部分。在这种情况下,'sheep'和'sheep?'就不再是同一个单词了。你可以决定在调用这样的函数之前,如何处理这些字符。这样的话

    import re
    s1 = re.sub('[.?]','', s1)
    s2 = re.sub('[.?]','', s2)

然后就可以照常继续了。

撰写回答