SequenceMatcher可以处理多个输入吗，而不仅仅是两个？

4 投票

1 回答

813 浏览

提问于 2025-04-15 21:08

我在想怎么处理这个问题，想知道有没有什么好的库可以用（最好是Python，但如果需要的话我也可以灵活一些）。

我有一个文件，每一行都有一个字符串。我想找出这些字符串中最长的公共模式，以及它们在每一行中的位置。我知道可以用SequenceMatcher来比较第一行和第二行、第一行和第三行，依此类推，然后把结果关联起来，但有没有什么工具可以直接做到这一点呢？

理想情况下，这些匹配可以出现在每一行的任何位置，不过一开始我可以接受它们在每一行的相同位置上出现。像是一个压缩库，能方便地访问它的字符串表，可能是个不错的选择，但到目前为止我还没找到符合这个描述的东西。

比如说，给定这些行：

\x00\x00\x8c\x9e\x28\x28\x62\xf2\x97\x47\x81\x40\x3e\x4b\xa6\x0e\xfe\x8b
\x00\x00\xa8\x23\x2d\x28\x28\x0e\xb3\x47\x81\x40\x3e\x9c\xfa\x0b\x78\xed
\x00\x00\xb5\x30\xed\xe9\xac\x28\x28\x4b\x81\x40\x3e\xe7\xb2\x78\x7d\x3e

我希望能看到在所有行中，0-1和10-12在同一位置匹配，而line1[4,5]与line2[5,6]匹配，line3[7,8]也匹配。

谢谢，

数据处理字符串匹配模式识别压缩算法文本分析序列比对公共模式行匹配

1 个回答

如果你只是想找出每一行中相同位置的公共子串，你只需要像这样做：

matches = []
zipped_strings = zip(s1,s2,s3)
startpos = -1
for i in len(zipped_strings):
  c1,c2,c3 = zipped_strings[i]
  # if you're not inside a match, 
  #  look for matching characters and save the match start position
  if startpos==-1 and c1==c2==c3:
    startpos = i
  # if you are inside a match, 
  #  look for non-matching characters, save the match to matches, reset startpos
  elif startpos>-1 and not c1==c2==c3:
    matches.append((startpos,i,s1[startpos:i]))
    # matches will contain (startpos,endpos,matchstring) tuples
    startpos = -1
# if you're still inside a match when you run out of string, save that match too!
if startpos>-1:
  endpos = len(zipped_strings)
  matches.append((startpos,endpos,s1[startpos:endpos]))

如果你想找出最长的公共模式，不管它们的位置在哪里，使用SequenceMatcher确实是个不错的主意。不过，不用先把字符串1和字符串2比较，再把字符串1和字符串3比较，然后试着合并结果。你可以先找出字符串1和字符串2的所有公共子串（用get_matching_blocks），然后把这些结果逐个和字符串3比较，这样就能找到三个字符串之间的匹配。

回答于 2025-04-15 由 Python大师

分享举报

SequenceMatcher可以处理多个输入吗，而不仅仅是两个？

1 个回答

撰写回答