不切词的最长公共子串 - Python

3 投票

9 回答

3904 浏览

提问于 2025-04-18 00:23

给定以下内容，我可以找到最长的公共子串：

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

print longest_common_substring(s1, s2)

[输出]:

foo bar

但是我该如何确保最长的公共子串遵循英语单词的边界，而不把一个单词切开呢？比如，以下这些句子：

s1 = "this is a foo bar sentence ."
s2 = "what a kappa foo bar black sheep ?"
print longest_common_substring(s1, s2)

输出的结果是这样的，这并不是我想要的，因为它把单词 kappa 从 s2 中切开了：

a foo bar

我想要的输出仍然是：

foo bar

我也尝试过使用 ngram 的方法来获取遵循单词边界的最长公共子串，但有没有其他方法可以处理字符串，而不需要计算 ngrams?（见答案）

字符串处理字符串匹配算法设计单词边界文本分析最长公共子串 ngram

9 个回答

你只需要在单词的开头和结尾加上检查。

然后，你只在有效的匹配结束时更新 m。

就像这样：

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    # current character in s1
    x_char = s1[x - 1]
    # we are at the beginning of a word in s1 if
    #   (we are at the beginning of s1) or 
    #   (previous character is a space)
    x_word_begin = (x == 1) or (s1[x - 2] == " ")
    # we are at the end of a word in s1 if
    #   (we are at the end of s1) or 
    #   (next character is a space)
    x_word_end = (x == len(s1)) or (s1[x] == " ")
    for y in xrange(1, 1 + len(s2)):
      # current character in s2
      y_char = s2[y - 1]
      # we are at the beginning of a word in s2 if
      #   (we are at the beginning of s2) or 
      #   (previous character is a space)
      y_word_begin = (y == 1) or (s2[y - 2] == " ")
      # we are at the end of a word in s2 if
      #   (we are at the end of s2) or 
      #   (next character is a space)
      y_word_end = (y == len(s2)) or (s2[y] == " ")
      if x_char == y_char:
        # no match starting with x_char
        if m[x - 1][y - 1] == 0:
          # a match can start only with a space
          #   or at the beginning of a word
          if x_char == " " or (x_word_begin and y_word_begin):
              m[x][y] = m[x - 1][y - 1] + 1
        else:
          m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          # the match can end only with a space
          #   or at the end of a word
          if x_char == " " or (x_word_end and y_word_end):
            longest = m[x][y]
            x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

回答于 2025-04-18 由 Python大师

分享举报

这个问题比我最开始想的要有趣得多。仔细想想，其实有四种可能的结果。

简单的情况，整个字符串都匹配，没有边界（你的第一个例子）
在开头跨越一个单词边界（你的第二个例子）
在结尾跨越一个单词边界
在两端都有单词边界

现在你的代码已经处理了简单的情况，所以我们可以利用这一点；剩下的就是在结果周围加上一些检查，来处理其他情况。那么这些检查应该是什么样的呢？我们来看一下你的失败案例：

string 1 = "this is a foo bar sentence ."
string 2 = "what a kappa foo bar black sheep ?"
output string = "a foo bar"

从字符串 find 的角度来看，我们可以在 string1 和 string2 中按顺序找到所有这些字母，但如果我们把空格周围的内容分开成列表，并且只按顺序查找这些列表，只有 string1 会匹配。

我主要是做C语言的，所以我想把这个写成一个函数：

def full_string(str1, str2, chkstr):
  l1 = str1.split()
  l2 = str2.split()
  chkl = chkstr.split()
  return (any(l1[i:i+len(chkl)]==chkl for i in xrange(len(l1)-len(chkl)+1)) and
          any(l2[i:i+len(chkl)]==chkl for i in xrange(len(l2)-len(chkl)+1)))

通过这个函数，我们可以检查这两个字符串中的任意一个是否不包含我们从 longest_common_substring(s1, s2) 得到的结果中的所有单词，并且是按顺序的。太好了。那么最后一步就是把这两个函数结合起来，检查上面列出的四种情况：

def longest_whole_substring(s1, s2):
  subs = longest_common_substring(s1, s2)
  if not full_string(s1, s2, subs):
    if full_string(s1, s2, ' '.join(subs.split()[1:])):
      subs = ' '.join(subs.split()[1:])
    elif full_string(s1, s2, ' '.join(subs.split()[:-1])):
      subs = ' '.join(subs.split()[:-1])
    else:
      subs = ' '.join(subs.split()[1:-1])
  return subs

现在函数 longest_whole_substring(s1, s2) 将提供最长的完整子字符串，不会切掉任何单词。让我们在每种情况下测试一下：

简单情况：

>>> a = 'this is a foo bar bar foo string'
>>> b = 'foo bar'
>>> 
>>> longest_whole_substring(a,b)
'foo bar'

开头有单词边界：

>>> b = 's a foo bar'
>>> 
>>> longest_whole_substring(a,b)
'a foo bar '

结尾有单词边界：

>>> b = 'foo bar f'
>>> 
>>> longest_whole_substring(a,b)
'foo bar'

两端都有单词边界：

>>> b = 's a foo bar f'
>>> 
>>> longest_whole_substring(a,b)
'a foo bar'

看起来不错！

回答于 2025-04-18 由 Python大师

分享举报

只需要在你的代码里加一个接受条件就行：

s1 = "this is a foo bar sentence ."
s2 = "what the foo bar blah blah black sheep is doing ?"

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest and word_aligned(x, y, m[x][y]):  # acceptance condition
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def word_aligned(x, y, length):
    """check that a match starting at s1[x - 1] and s2[y - 1] is aligned on a word boundary"""
    # check start of match in s1
    if s1[x - 1].isspace():
        # match doesn't start with a character, reject
        return False
    if x - 2 > 0 and not s1[x - 2].isspace():
        # char before match is not start of line or space, reject
        return False
    # check start of match in s2
    ... same as above ...
    # check end of match in s1
    ... your code is a bit hard for me follow, what is end of match? ...
    # check end of match in s2
    ... same as above ...
    return True

print longest_common_substring(s1, s2)

回答于 2025-04-18 由 Python大师

分享举报

我的回答并不是基于任何官方资料，而只是一个简单的观察：至少在我的安装环境中，你的LCS函数在处理（s1, s2）和（s1, s3）这对字符串时，输出结果是有区别的。

In [1]: s1 = "this is a foo bar sentence ."

In [3]: s2 = "what the foo bar blah blah black sheep is doing ?"

In [4]: s3 = "what a kappa foo bar black sheep ?"

In [12]: longest_common_substring(s1, s3)
Out[12]: 'a foo bar '

In [13]: longest_common_substring(s1, s2)
Out[13]: ' foo bar '

你可能会注意到，如果完整的单词匹配成功，那么周围的空格也会被匹配上。

你可以在返回输出之前，修改这个函数，像这样：

answer = s1[x_longest - longest: x_longest]
if not (answer.startswith(" ") and answer.endswith(" ")):
    return longest_common_substring(s1, answer[1:])
else:
    return answer

我相信还有其他一些特殊情况，比如子字符串出现在字符串的末尾，递归调用函数时使用或s2，是否要去掉answer的前面或后面的空格等等——但至少在你展示的这些情况中，这个简单的修改可以达到你想要的效果：

In [20]: longest_common_substring(s1, s3)
Out[20]: ' foo bar '

你觉得这个方向值得进一步探索吗？

回答于 2025-04-18 由 Python大师

分享举报

这个内容太简单了，容易理解。我用你的代码完成了75%的工作。首先，我把句子拆分成单词，然后把这些单词传给你的函数，这样就能得到最大的公共子串（在这个例子中就是最长的连续单词）。你的函数给我返回了['foo', 'bar']，我把这个数组里的元素连接起来，就得到了想要的结果。

这里有一个在线的工作副本，你可以测试、验证和随意修改。

http://repl.it/RU0/1

def longest_common_substring(s1, s2):
  m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
  longest, x_longest = 0, 0
  for x in xrange(1, 1 + len(s1)):
    for y in xrange(1, 1 + len(s2)):
      if s1[x - 1] == s2[y - 1]:
        m[x][y] = m[x - 1][y - 1] + 1
        if m[x][y] > longest:
          longest = m[x][y]
          x_longest = x
      else:
        m[x][y] = 0
  return s1[x_longest - longest: x_longest]

def longest_common_sentence(s1, s2):
    s1_words = s1.split(' ')
    s2_words = s2.split(' ')  
    return ' '.join(longest_common_substring(s1_words, s2_words))


s1 = 'this is a foo bar sentence .'
s2 = 'what a kappa foo bar black sheep ?'
common_sentence = longest_common_sentence(s1, s2)
print common_sentence
>> 'foo bar'

边界情况

在你的例子中，像'.'和'?'这样的符号也被当作有效的单词处理，前提是它们和最后一个单词之间有空格。如果没有空格，它们就会被算作最后一个单词的一部分。在这种情况下，'sheep'和'sheep?'就不再是同一个单词了。你可以决定在调用这样的函数之前，如何处理这些字符。这样的话

import re
s1 = re.sub('[.?]','', s1)
s2 = re.sub('[.?]','', s2)

然后就可以照常继续了。

回答于 2025-04-18 由 Python大师

分享举报

不切词的最长公共子串 - Python

9 个回答

撰写回答