查找python函数以查找字符串中最长的顺序重复子字符串

2024-05-16 08:20:03 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在对DNA序列进行编码,我感兴趣的是一个寻找序列重复的函数(这可能表示引物可能“滑倒”或做坏事的地方)

我感兴趣的一个例子如下:

longest_repeat('ATTTTCCATGATGATG')

这将输出重复长度和坐标,在本例中为9长和7:15。函数应该在末尾拾取ATGATG,因为它比TTTT重复和TGA重复更长,所以它只报告ATGATG。就领带而言,我想知道它是否能报告所有的领带,或者至少其中一条

如果设置一个阈值,只报告超过特定长度的连续重复,也会很好

我在python方面有一些经验,但这个特定的问题一直困扰着我,因为如果我编写的代码效率低下,并且输入一个50个字符长的字符串,可能会花费很长时间。我感谢所有的帮助


Tags: 函数编码longest报告地方序列感兴趣例子
2条回答

以下内容将非常有效。它返回最长的序列、长度、起始索引和结束索引。如果存在多个最大长度的序列,结果将是它们的列表。函数longest(s,threshold)中的第二个参数是所需的阈值最小长度:

import numpy as np

def b(n): #it returns the factors of an integer. It will be used in next function
    r = np.arange(1, int(n ** 0.5) + 1)
    x = r[np.mod(n, r) == 0]
    return set(np.concatenate((x, n / x), axis=None))
   
def isseq(s): #it tests if a string is a sequence. Using the result from previous function it compares all smaller parts of the devided string to check if they are equal
    l=[int(p) for p in sorted(list(b(len(s))))[:-1]]
    for i in l:
        if len(set(s[k*i:i*(k+1)] for k in range(len(s)//i)))==1:
            return True
    return False

def longest(s, threshold): #the main function that returns the lenghtier sequense or a list of them if they are multiple, using a threshold as minimum length
    m=[]
    for i in range(len(s), threshold-1, -1):
        for k in range(len(s)-i+1):
            if isseq(s[k:k+i]):
                m.append([s[k:k+i], i, k, k+i-1])
        if len(m)>0:
            return m
    return False

示例:

>>>s='ATTTTCCATGATGATGGST'
>>> longest(s, 1)
[['ATGATGATG', 9, 7, 15]]

>>> s='ATTTTCCATGATGATGGSTLWELWELWEGFRJGHIJH'
>>> longest(s, 1)
[['ATGATGATG', 9, 7, 15], ['LWELWELWE', 9, 19, 27]]


>>>s='ATTTTCCATGATGATGGSTWGTKWKWKWKWKWKWKWKWKWKWKWFRGWLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERFGTFRGFTRUFGFGRFGRGBHJ'
>>> longest(longs, 1)
[['LWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWERLWER', 64, 48, 111]]

以下是一个解决方案:

def longest_repeat(seq, threshold):
    results = []
    longest = threshold
    
    # starting position
    for i in range(len(seq)):
        
        # pattern period
        for p in range(1, (len(seq)-i)//2+1):
            # skip unecessary combinations
            if results != [] and results[-1][0] == i and results[-1][3] % p == 0: continue
            
            # max possible number of repetitions
            repetitions = len(seq)//p
            
            # position within the pattern's period
            for k in range(p):
                # get the max repetitions the k-th character in the period can support
                m = 1
                while i+k+m*p < len(seq) and seq[i+k] == seq[i+k+m*p]:
                    m += 1
                repetitions = min(m, repetitions)
                
                # check if we're already below the best result so far 
                if repetitions*p < longest:    break
            
            # save the result if it's good
            if repetitions > 1 and repetitions*p >= longest:
                # overwrite lesser results
                if repetitions*p > longest: results = []
                
                # store the current one (with ample information)
                results += [(i, seq[i:i+p], repetitions, repetitions*p)]
                longest = max(longest, repetitions*p)
    
    return results

逻辑是,您运行序列中的每个起始位置(i),检查每个合理的模式周期(p),并为该组合检查它们是否产生至少与迄今为止最好的子字符串一样好的子字符串(或者阈值,如果尚未找到结果)

结果是(starting index, period string, repetitions, total length)形式的元组列表。以身作则

threshold = 5
seq = 'ATTTCCATGATGATG'

t = time.time()
results = longest_repeat(seq, threshold)
print("execution time :", time.time()-t)

for t in results:
    print(t)

我们得到

exec : 0.00010848045349121094
(6, 'ATG', 3, 9)

从那里,获得完全匹配的字符串是很简单的(只需执行period_string * repetitions

对于700个字符的随机输入,执行时间约为6.8秒,而使用@IoaTzimas的答案时执行时间约为20.2秒

相关问题 更多 >