<p>以下是一个解决方案:</p>
<pre class="lang-py prettyprint-override"><code>def longest_repeat(seq, threshold):
results = []
longest = threshold
# starting position
for i in range(len(seq)):
# pattern period
for p in range(1, (len(seq)-i)//2+1):
# skip unecessary combinations
if results != [] and results[-1][0] == i and results[-1][3] % p == 0: continue
# max possible number of repetitions
repetitions = len(seq)//p
# position within the pattern's period
for k in range(p):
# get the max repetitions the k-th character in the period can support
m = 1
while i+k+m*p < len(seq) and seq[i+k] == seq[i+k+m*p]:
m += 1
repetitions = min(m, repetitions)
# check if we're already below the best result so far
if repetitions*p < longest: break
# save the result if it's good
if repetitions > 1 and repetitions*p >= longest:
# overwrite lesser results
if repetitions*p > longest: results = []
# store the current one (with ample information)
results += [(i, seq[i:i+p], repetitions, repetitions*p)]
longest = max(longest, repetitions*p)
return results
</code></pre>
<p>逻辑是,您运行序列中的每个起始位置(<code>i</code>),检查每个合理的模式周期(<code>p</code>),并为该组合检查它们是否产生至少与迄今为止最好的子字符串一样好的子字符串(或者阈值,如果尚未找到结果)</p>
<p>结果是<code>(starting index, period string, repetitions, total length)</code>形式的元组列表。以身作则</p>
<pre class="lang-py prettyprint-override"><code>threshold = 5
seq = 'ATTTCCATGATGATG'
t = time.time()
results = longest_repeat(seq, threshold)
print("execution time :", time.time()-t)
for t in results:
print(t)
</code></pre>
<p>我们得到</p>
<pre><code>exec : 0.00010848045349121094
(6, 'ATG', 3, 9)
</code></pre>
<p>从那里,获得完全匹配的字符串是很简单的(只需执行<code>period_string * repetitions</code>)</p>
<p>对于700个字符的随机输入,执行时间约为6.8秒,而使用@IoaTzimas的答案时执行时间约为20.2秒</p>