fi中的Levenshtein距离

2024-04-25 13:24:18 发布

您现在位置:Python中文网/ 问答频道 /正文

声明说:

修改上述程序,使给定GGCCTTGCCATTGG模式,前一个文件的前10行中的每一行都指示:

·找到与该行更相似的子字符串的编辑距离。在

·找到编辑最小距离的那一行的子字符串

以上程序如下:

import time

def levenshtein_distance (first, second):
    if len(first) > len(second):
        first, second = second, first
    if len(second) == 0:
        return len(fist)
    first_length = len(first) + 1
    second_length = len(second) + 1
    distance_matrix = [[0]*second_length for x in range(first_length)]
    for i in range(first_length): distance_matrix[i][0] = i
    for j in range(second_length): distance_matrix[0][j] = j
    for i in xrange(1, first_length):
        for j in range(1, second_length):
            deletion = distance_matrix[i-1][j] + 1
            insertion = distance_matrix[i][j-1] + 2
            substitution = distance_matrix[i-1][j-1] + 1
            if first[i-1] != second[j-1]:
                substitution += 1
            distance_matrix[i][j] = min(insertion, deletion, substitution)
    return distance_matrix[first_length-1][second_length-1]

def dna(patro):
    t1 = time.clock()
    f = open("HUMAN-DNA.txt")
    text = f.readlines()
    f.close()

    distanciaMin = 100000000 
    distanciaPosicion = 0
    distanciaLinea = 0
    distanciaSubstring = ""
    numeroLinea = 0
    for line in text:
        numeroLinea = numeroLinea + 1
        for i in range(len(line)-len(patro)):
            cadena = line[i:i+len(patro)]
            distancia = levenshtein_distance(cadena, patro)
            if distancia < distanciaMin:
                distanciaMin = distancia
                distanciaPosicion = 1
                distanciaLinea = numeroLinea
                distanciaSubstring = cadena

    t2 = time.clock()

现在我把新的模式

^{pr2}$

我有编辑的距离,是距离,我不确定距离的结果,也就是那一行的子串(陈述的第二点),我的问题是,我如何计算文本中的前十行?在

文件的一部分是:

CCCATCTCTTTCTCATTCCTTGGTTGAGAACACGAACTTCAGGACTTGCCTCACACTAGGGCCCATTCTT
TGTTTCCCAGAAAGAAGAGGCTCTCCACACAGAGTCCCATGTACACCAGGCTGTCAACAAACATGAATTG
AATGAAGGAGTGGATGGTTGGGTGGAAGTGATTTAAGAAATCCTAACTGGGGAATTTCACTGGAAACTTA
GGAAATTCAATTTATATAAAGTCTATGAATCGTCCATTTTTGTGTCCGCACATTCAAATGCTGTAGCTAA
TTTCCTGCTAAACAGTAGAAATTCAGTAAGTGTTCATGTTGAAAGGATGAAATTTGAGTGCTCTTGCATC
CTCAAAGAACTCTAGTAAAATAGAAATAAAGCTTTATTTGGAAGATTAAGTCATGAGCATAATTATGAGA
AGGCGGTCATTCTAATAATAGTGTCTTCACAAGTAGATGCTACATGCTGTGTAATATTTTGACTAAAAAA
AGTTCCTCTCAACATTTCTGAAGTGAGATAATGTACAACGATCCATGTTTTTAGCTACCTTGATAAGTTT
AGTGCATCCAGGGCTCCTTTCTTACCTGCTAACCGCCGAGTTTCAAATGCTAAGAAATTCTTCATTTCCT
AACACAAATATTCAATATAATTGCTGGTTGTTTGGGAGAAGAAAAATTTAGAATTCAGAAAGAAATACAG
AATGAAATGTTCTAATCAATCGAAAAAGGATTCTATAGACTTCGACGTTGTCTGGTTTACAAAGCAGTCT

Tags: in编辑距离forleniftimerange
1条回答
网友
1楼 · 发布于 2024-04-25 13:24:18

我不明白你的全部问题。但我正在努力解决How can i count the first ten lines in the text?。你可以用filehandler.readlines文件处理程序(). 它将以列表的形式在内存中加载文件,其中每一行用新行字符分隔。 然后你可以从列表中读出10行。你可以试试这样的方法

>>> a = [0,1,2,3,4,5,6,7,8,9] # read file as a list of lines (a)
>>> def line(a, jump=2): # keep jump = 10 for your requirement.
    lines = len(a)
    i = 0
    while i < lines+1:
        yield a[i:i+jump]
        i += jump
>>> foo = line(a)
>>> foo.next()
[0, 1]
>>> foo.next()
[2, 3]
>>> foo.next()
[4, 5]

对于你的代码来说

^{pr2}$

相关问题 更多 >