在有错误的子串中查找

2024-04-25 14:39:21 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要在原谅一定数量的错误(n)的同时,找出A是否是B的子串。A和B都是字符串,n是整数。最大的问题是我的内部while循环。我不知道如何编码它,使它输出我想要的。它应该在B中循环寻找a与n个错误的匹配。如果A是“abcd”,B是“bdefabddghl”,那么输出应该是在位置4处发现A,有1处错误。 这是我现在的密码。我只需要知道如何编写内部while循环。非常感谢您的帮助。你知道吗

def subsequence(A,B,n):
answer = -1             # assume worst case for there-exists type of loop
j = 0                   # index b
while j<=len(B)-len(A) and answer==-1:  # there exists j in b loop
    bx = j              # assume a is found in b starting at b[j]
    i = 0               # assume best case for a for-all type of loop
    axy = n             # accumulator for n
        while i<len(A) and bx==j and axy > 0:   # for all i in a
        if A[i] == B[j-i] and axy > 0:
            bx = j
            axy = n         # accumulator for n
        if A[i] != B[j-i] and axy > 0:
            axy -= 1
        i+=1
        # while i
    j+=1
# while j
end = "A best match with " + str(n-axy) + " errors was found starting at position " + str(bx)."
return end
print subsequence("abcd","bcjabddec",3)

谢谢


Tags: andanswerinloopforlen错误there
2条回答

您可能应该看看difflib.get_close_matches函数。它的作用几乎是一样的。如果您真的想实现它,那么只需构建一个相同长度的所有子序列的列表并按匹配数对它们进行排序,就更容易实现了。你知道吗

def subsequence(a, b, n):

    sb = [(i, b[i:i+len(a)]) for i in xrange(len(b) - len(a) + 1)]

    def count_matches(s):
        return sum(int(x == y) for (x, y) in zip(a, s[1]))

    sb.sort(key=count_matches, reverse=True)

    best = sb[0]

    e = len(a) - count_matches(best)
    i = best[0]
    w = best[1]

    print 'A best match with %d errors was found starting at position %d' % (e, i)

我假设你正在做这个练习,或者类似的:http://www.cs.hofstra.edu/~cscccl/csc15p/dnalab.txt

我不确定开始的例子是否特别有用。它很难理解,因此很难按照指定的方式进行调整。我想就如何解决这些问题提出一些建议:

  1. 把问题分解成更容易解决的小问题。你知道吗
  2. 编写测试每个函数的代码,给出预期的结果。你知道吗
  3. 使用变量名和函数名来明确它们的含义,这样代码对于您自己和您展示给的人来说都更容易理解。你知道吗

你要做的是沿着线B滑动线A,在每个位置测试它的匹配度,记住你找到的最佳匹配。问题中一个容易区分的部分是测量两个字符串在特定对齐方式下的匹配程度。你知道吗

以下是我对解决方案的一般形式的建议:

def string_distance(A, B, index_b):
    '''Count how many substitutions are required to
    change string A into the string beginning at
    index_b in string B.'''
    return 999 # FIXME

def subsequence(A, B, n):
    '''Search for the substring of B that is equal to
    A except for the minimal number of substitutions,
    and return a tuple of the index of that substring
    in B and the number of substitutions required,
    unless there is no such substring with n or fewer
    substitutions, in which case return the tuple of
    (-1, -1).'''
    answer_distance = -1
    answer_index = -1
    index = 0
    while index <= len(B) - len(A):
        substitutions_required = string_distance(A, B, index)
        # Does a match at this location need n or fewer substitutions?
        is_close_enough = False # FIXME
        # Is a match at this location better than the best match so far?
        is_better = False # FIXME
        if is_close_enough and is_better:
            answer_distance = substitutions_required
            answer_index = index
        index += 1
    return answer_index, answer_distance

下面是一些基本测试的样子:

def test_string_distance():
    test_data = [
        # a         b          index_b   expected
        ("ABC",    "ABCDEF",   0,        0),
        ("ABX",    "ABCDEF",   0,        1),
        ("XYZ",    "ABCDEF",   0,        3),
        ("XBC",    "ABCDEF",   0,        1),
        ("CEE",    "ABCDEF",   2,        1),
        ("DEF",    "ABCDEF",   3,        0),
        ("AAAAA",  "BBBBBBBB", 3,        5),
        ("BAAAA",  "BBBBBBBB", 3,        4),
        ("ABAAB",  "BBBBBBBB", 3,        3),
    ]
    for a, b, index_b, expected in test_data:
        result = string_distance(a, b, index_b)
        if result != expected:
            print "string_distance({}, {}, {}) was {} but should be {}".format(
                    a, b, index_b, result, expected)

def test_subsequence():
    test_data = [
        # A         B               n   expected
        ("AXY",    "AYAXXXAAYYAX",  3,  (2,1)),
        ("AXY",    "AYAXXXAAYYAX",  2,  (2,1)),
        ("AXY",    "AYAXXXAAYYAX",  1,  (2,1)),
        ("AXY",    "AYAXXXAAYYAX",  0,  (-1,-1)),
        ("XAAY",   "AYAXXXAAYYAX",  2,  (5,0)),
        ("XAAY",   "XXAXAAXAAY",    2,  (6,0)),
        ("ABCDEF", "ZZAABAXCDEEEF", 3,  (5,2)),
        ("ABCDEF", "ZZAABAXCDEEEF", 2,  (5,2)),
        ("ABCDEF", "ZZAABAXCDEEEF", 1,  (-1,-1)),
        ("ABCDEF", "ZZAABXBCDEEEF", 3,  (5,2)),
    ]
    for a, b, n, expected in test_data:
        result = subsequence(a, b, n)
        if result != expected:
            print "test_subsequence({}, {}, {}) was {} but should be {}".format(
                    a, b, n, result, expected)

if __name__ == '__main__':
    test_string_distance()
    test_subsequence()

首先找出一个字符串距离的实现,它将通过这些测试。然后开始工作。你知道吗

相关问题 更多 >