查找最长的相邻重复非重叠子字符串问题的回答

查找最长的相邻重复非重叠子字符串

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

（这个问题与音乐无关，但我以音乐为例（一个用例。） 在音乐中，短语结构的一种常见方式是按音符序列中间部分重复一次或多次。因此，这个短语由引言、循环部分和输出部分组成。这里有一个例如： <pre><code>[ E E E F G A F F G A F F G A F C D ] </code></pre> 我们可以“看到”介绍是[E]，重复部分是[F G A] F]输出为[cd]。因此，拆分列表的方法是 <pre><code>[ [ E E E ] 3 [ F G A F ] [ C D ] ] </code></pre> 其中第一项是简介，第二项是重复部分重复，第三部分输出 我需要一个算法来执行这样的拆分 但有一点需要注意，那就是可能有多种方法拆分列表。例如，上述列表可分为： <pre><code>[ [ E E E F G A ] 2 [ F F G A ] [ F C D ] ] </code></pre> 但这是一个更糟糕的分裂，因为介绍和介绍更长。所以该算法的标准是找到最大化环件的长度，并使环件的组合长度最小开场白和开场白。这意味着正确的分割 <pre><code>[ A C C C C C C C C C A ] </code></pre> 是 <pre><code>[ [ A ] 9 [ C ] [ A ] ] </code></pre> 因为导入和导出的组合长度是2，而循环部分的长度为9 此外，虽然intro和outro可以是空的，但只有“true”重复是空的允许。因此，不允许进行以下拆分： <pre><code>[ [ ] 1 [ E E E F G A F F G A F F G A F C D ] [ ] ] </code></pre> 可以将其视为为为数据找到最佳的“压缩” 序列请注意，在某些序列中可能没有任何重复： <pre><code>[ A B C D ] </code></pre> 对于这些退化情况，任何合理的结果都是允许的 以下是我对算法的实现： <pre><code>def find_longest_repeating_non_overlapping_subseq(seq): candidates = [] for i in range(len(seq)): candidate_max = len(seq[i + 1:]) // 2 for j in range(1, candidate_max + 1): candidate, remaining = seq[i:i + j], seq[i + j:] n_reps = 1 len_candidate = len(candidate) while remaining[:len_candidate] == candidate: n_reps += 1 remaining = remaining[len_candidate:] if n_reps > 1: candidates.append((seq[:i], n_reps, candidate, remaining)) if not candidates: return (type(seq)(), 1, seq, type(seq)()) def score_candidate(candidate): intro, reps, loop, outro = candidate return reps - len(intro) - len(outro) return sorted(candidates, key = score_candidate)[-1] </code></pre> 我不确定它是否正确，但它通过了我做的简单测试描述。问题是，这是一种缓慢的方式。我已经看过了在后缀树上，但它们似乎不适合我的用例，因为我要寻找的子字符串应该是不重叠和相邻的

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

这是一种明显是二次时间的方法，但常数因子相对较低，因为除了长度为1的子串对象外，它不构建任何子串对象。结果是一个2元组 <pre><code>bestlen, list_of_results </code></pre> 其中<code>bestlen</code>是重复相邻块的最长子串的长度，每个结果是一个3元组 <pre><code>start_index, width, numreps </code></pre> 这意味着正在重复的子字符串是 <pre><code>the_string[start_index : start_index + width] </code></pre> 还有<code>numreps</code>个相邻的。永远都是这样 <pre><code>bestlen == width * numreps </code></pre> 问题描述留下了歧义。例如，考虑这个输出： <pre><code>>>> crunch2("aaaaaabababa") (6, [(0, 1, 6), (0, 2, 3), (5, 2, 3), (6, 2, 3), (0, 3, 2)]) </code></pre> 因此，它找到了5种方法来将“最长”的拉伸视为长度6： <ul> <li>首字母“a”重复6次</李> <li>首字母“aa”重复3次</李> <li>“ab”最左边的实例重复了3次</李> <li>“ba”最左边的实例重复了3次</李> <li>最初的“aaa”重复了2次</李> </ul> 它不返回intro或outro，因为从它返回的内容中可以推断出它们是微不足道的： <ul> <li>简介是<code>the_string[: start_index]</code></李> <li>输出是<code>the_string[start_index + bestlen :]</code></李> </ul> 如果没有重复的相邻块，则返回 <pre><code>(0, []) </code></pre> 其他示例（来自您的帖子）： <pre><code>>>> crunch2("EEEFGAFFGAFFGAFCD") (12, [(3, 4, 3)]) >>> crunch2("ACCCCCCCCCA") (9, [(1, 1, 9), (1, 3, 3)]) >>> crunch2("ABCD") (0, []) </code></pre> 其工作原理的关键：假设您有相邻的重复块，每个块的宽度为<code>W</code>。然后考虑当将原始字符串与由左移位的字符串{{CD5>}：进行比较时会发生什么情况 <pre><code>... block1 block2 ... blockN-1 blockN ... ... block2 block3 ... blockN ... ... </code></pre> 然后在相同的位置获得<code>(N-1)*W</code>个连续的相等字符。但是也在另一个方向起作用：如果向左移动<code>W</code>并找到<code>(N-1)*W</code>个连续的相等字符，那么您可以推断： <pre><code>block1 == block2 block2 == block3 ... blockN-1 == blockN </code></pre> 所以所有<code>N</code>块都必须是block1的重复 因此，代码重复地将原始字符串左移（副本）一个字符，然后在这两个字符串上从左向右移动，以识别相等字符的最长长度。这只需要一次比较一对字符。为了使“左移”有效（恒定时间），字符串的副本存储在<code>collections.deque</code>中 编辑：<code>update()</code>在许多情况下做了太多无用的工作；换了它 <pre><code>def crunch2(s): from collections import deque # There are zcount equal characters starting # at index starti. def update(starti, zcount): nonlocal bestlen while zcount >= width: numreps = 1 + zcount // width count = width * numreps if count >= bestlen: if count > bestlen: results.clear() results.append((starti, width, numreps)) bestlen = count else: break zcount -= 1 starti += 1 bestlen, results = 0, [] t = deque(s) for width in range(1, len(s) // 2 + 1): t.popleft() zcount = 0 for i, (a, b) in enumerate(zip(s, t)): if a == b: if not zcount: # new run starts here starti = i zcount += 1 # else a != b, so equal run (if any) ended elif zcount: update(starti, zcount) zcount = 0 if zcount: update(starti, zcount) return bestlen, results </code></pre> <h2>使用regexp</h2> [由于尺寸限制，已删除此项] <h2>使用后缀数组</h2> 这是迄今为止我发现的最快的，尽管仍然可以激发成二次时间行为 请注意，是否找到重叠字符串并不重要。正如上文对<code>crunch2()</code>计划所解释的（此处以次要方式详述）： <ul> <li>给定长度为<code>n = len(s)</code>的字符串<code>s</code></李> <li>给定int<code>i</code>和<code>j</code>与<code>0 <= i < j < n</code></李> </ul> 然后，如果<code>w = j-i</code>和<code>c</code>是<code>s[i:]</code>和<code>s[j:]</code>之间共同的前导字符数，则从<code>s[i]</code>开始，重复块<code>s[i:j]</code>（长度为<code>w</code>），共<code>1 + c // w</code>次 下面的程序直接按照该步骤查找所有重复的相邻块，并记住最大总长度的块。返回与<code>crunch2()</code>相同的结果，但有时顺序不同 后缀数组简化了搜索，但很难消除它。后缀数组直接查找具有最大<code>c</code>的<code><i, j></code>对，但仅限制搜索以最大化<code>w * (1 + c // w)</code>。最糟糕的情况是<code>letter * number</code>形式的字符串，如<code>"a" * 10000</code> 我没有给出下面<code>sa</code>模块的代码。这是一个冗长的过程，任何后缀数组的实现都会计算相同的内容。{<cd34>}的输出： <ul> <li><code>sa</code>是后缀数组，是<code>range(n)</code>的唯一排列，因此对于<code>range(1, n)</code>中的所有<code>i</code>，<code>s[sa[i-1]:] < s[sa[i]:]</code> </li> <li><code>rank</code>在这里不使用 </li> <li>对于<code>range(1, n)</code>中的<code>i</code>，<code>lcp[i]</code>给出从<code>sa[i-1]</code>开始的后缀和<code>sa[i]</code>之间最长公共前缀的长度 </li> </ul> 为什么它会赢？部分原因是它从不必搜索以使用相同的字母（后缀数组通过构造使它们相邻），并检查重复的块，以及它是否是新的最佳块，无论块有多大或重复了多少次，都需要很短的恒定时间。如上所述，这只是关于<code>c</code>和<code>w</code>的简单算术 免责声明：后缀数组/树对我来说就像连分数：我可以在必要时使用它们，并且可以对结果感到惊奇，但它们让我头疼。易怒，易怒，易怒 <pre><code>def crunch4(s): from sa import suffix_array sa, rank, lcp = suffix_array(s) bestlen, results = 0, [] n = len(s) for sai in range(n-1): i = sa[sai] c = n for saj in range(sai + 1, n): c = min(c, lcp[saj]) if not c: break j = sa[saj] w = abs(i - j) if c < w: continue numreps = 1 + c // w assert numreps > 1 total = w * numreps if total >= bestlen: if total > bestlen: results.clear() bestlen = total results.append((min(i, j), w, numreps)) return bestlen, results </code></pre> <h2>一些时间安排</h2> 我把一小段英文单词读入一个字符串，<code>xs</code>。每行一个字： <pre><code>>>> len(xs) 209755 >>> xs.count('\n') 25481 </code></pre> 因此，大约210K字节中有25K个字。这些是二次时间算法，所以我没想到它会运行得很快，但是<code>crunch2()</code>在几个小时后仍然在运行，而当我让它运行一夜时，仍然在运行 这使我意识到它的<code>update()</code>函数可能会做大量无用的工作，使算法总体上更像立方时间。所以我把它修好了。然后： <pre><code>>>> crunch2(xs) (44, [(63750, 22, 2)]) >>> xs[63750 : 63750+50] '\nelectroencephalograph\nelectroencephalography\nelec' </code></pre> 这花了大约38分钟，与我预期的差不多 regexp版本<code>crunch3()</code>只花了不到十分之一秒的时间 <pre><code>>>> crunch3(xs) (8, [(19308, 4, 2), (47240, 4, 2)]) >>> xs[19308 : 19308+10] 'beriberi\nB' >>> xs[47240 : 47240+10] 'couscous\nc' </code></pre> 正如前面所解释的，regexp版本可能找不到最佳答案，但这里还有其他一些东西在起作用：默认情况下，“.”与换行符不匹配，因此代码实际上进行了许多tiny搜索。文件中的~25K个换行符中的每一个都有效地结束了本地搜索范围。使用<code>re.DOTALL</code>标志编译regexp（因此不专门处理换行）： <pre><code>>>> crunch3(xs) # with DOTALL (44, [(63750, 22, 2)]) </code></pre> 14分钟多一点 最后, <pre><code>>>> crunch4(xs) (44, [(63750, 22, 2)]) </code></pre> 不到9分钟。构建后缀数组的时间只是其中微不足道的一部分（不到一秒钟）。这实际上相当令人印象深刻，因为并非总是正确的蛮力regexp版本速度较慢，尽管它几乎完全以“C速度”运行 但这是相对的。从绝对意义上说，所有这些仍然是缓慢的：—— 注意：下一节中的版本将此时间缩短到5秒以下（！） <h2>快得多</h2> 这一个采用了完全不同的方法。对于上面的大型词典示例，它在不到5秒钟的时间内得到了正确的答案 我对此感到相当自豪；-）这是出乎意料的，我以前从未见过这种方法。它不做任何字符串搜索，只对索引集进行整数运算 对于形式为<code>letter * largish_integer</code>的输入，它仍然非常慢。按原样，只要子字符串（当前长度的）至少存在两个副本（不一定相邻，甚至不重叠！），它就会继续增加1。例如，在 <pre><code>'x' * 1000000 </code></pre> 它将尝试从1到999999的所有子字符串大小 但是，通过将当前大小重复增加一倍（而不是仅仅添加1），保存类，执行混合形式的二进制搜索来定位存在重复的最大子字符串大小，似乎可以极大地改善这一点 我将把它作为一个毫无疑问的乏味练习留给读者。我在这里的工作已经完成；-） <pre><code>def crunch5(text): from collections import namedtuple, defaultdict # For all integers i and j in IxSet x.s, # text[i : i + x.w] == text[j : j + x.w]. # That is, it's the set of all indices at which a specific # substring of length x.w is found. # In general, we only care about repeated substrings here, # so weed out those that would otherwise have len(x.s) == 1. IxSet = namedtuple("IxSet", "s w") bestlen, results = 0, [] # Compute sets of indices for repeated (not necessarily # adjacent!) substrings of length xs[0].w + ys[0].w, by looking # at the cross product of the index sets in xs and ys. def combine(xs, ys): xw, yw = xs[0].w, ys[0].w neww = xw + yw result = [] for y in ys: shifted = set(i - xw for i in y.s if i >= xw) for x in xs: ok = shifted & x.s if len(ok) > 1: result.append(IxSet(ok, neww)) return result # Check an index set for _adjacent_ repeated substrings. def check(s): nonlocal bestlen x, w = s.s.copy(), s.w while x: current = start = x.pop() count = 1 while current + w in x: count += 1 current += w x.remove(current) while start - w in x: count += 1 start -= w x.remove(start) if count > 1: total = count * w if total >= bestlen: if total > bestlen: results.clear() bestlen = total results.append((start, w, count)) ch2ixs = defaultdict(set) for i, ch in enumerate(text): ch2ixs[ch].add(i) size1 = [IxSet(s, 1) for s in ch2ixs.values() if len(s) > 1] del ch2ixs for x in size1: check(x) current_class = size1 # Repeatedly increase size by 1 until current_class becomes # empty. At that point, there are no repeated substrings at all # (adjacent or not) of the then-current size (or larger). while current_class: current_class = combine(current_class, size1) for x in current_class: check(x) return bestlen, results </code></pre> <h2>而且更快</h2> <code>crunch6()</code>在我的框中，将较大的字典示例的时间降低到2秒以下。它结合了<code>crunch4()</code>（后缀和lcp数组）和<code>crunch5()</code>（在一组索引中查找具有给定步长的所有算术级数）的思想 与<code>crunch5()</code>类似，它也会循环多次，循环次数等于重复的最长子串的长度（重叠与否）的1倍。因为如果没有长度为<code>n</code>的重复，那么任何大于<code>n</code>的大小也没有重复。这使得查找重复而不考虑重叠变得更容易，因为这是一个可利用的限制。当将“胜利”约束到相邻的重复时，这种情况就发生了。例如，“abcabc”中没有偶数长度为1的相邻重复，但有一个长度为1的相邻重复3.这似乎使得任何形式的直接二进制搜索都是徒劳的（大小为<code>n</code>的相邻重复序列的存在或不存在与任何其他大小的相邻重复序列的存在无关） 形式为<code>'x' * n</code>的输入仍然很糟糕。存在从1到<code>n-1</code>的所有长度的重复 观察：我给出的所有程序都会生成所有可能的方法来分解最大长度的重复相邻块。例如，对于9“x”的字符串，它表示可以通过重复“x”9次或重复“xxx”3次来获得。因此，令人惊讶的是，它们也可以用作因式分解算法；-） <pre><code>def crunch6(text): from sa import suffix_array sa, rank, lcp = suffix_array(text) bestlen, results = 0, [] n = len(text) # Generate maximal sets of indices s such that for all i and j # in s the suffixes starting at s[i] and s[j] start with a # common prefix of at least len minc. def genixs(minc, sa=sa, lcp=lcp, n=n): i = 1 while i < n: c = lcp[i] if c < minc: i += 1 continue ixs = {sa[i-1], sa[i]} i += 1 while i < n: c = min(c, lcp[i]) if c < minc: yield ixs i += 1 break else: ixs.add(sa[i]) i += 1 else: # ran off the end of lcp yield ixs # Check an index set for _adjacent_ repeated substrings # w apart. CAUTION: this empties s. def check(s, w): nonlocal bestlen while s: current = start = s.pop() count = 1 while current + w in s: count += 1 current += w s.remove(current) while start - w in s: count += 1 start -= w s.remove(start) if count > 1: total = count * w if total >= bestlen: if total > bestlen: results.clear() bestlen = total results.append((start, w, count)) c = 0 found = True while found: c += 1 found = False for s in genixs(c): found = True check(s, c) return bestlen, results </code></pre> <h2>总是快，出版，但有时错</h2> 在生物信息学中，这是在“串联重复”、“串联阵列”和“简单序列重复”（SSR）的名称下研究的。您可以搜索这些术语来查找相当多的学术论文，其中一些声称是最坏情况下的线性时间算法 但这些似乎分为两大阵营： <ol> <li>所描述的线性时间算法，实际上是错误的：-（</li> <li>算法如此复杂，以至于要想把它们转换成有效的代码都需要付出很大的努力：-（</li> </ol> 在第一个阵营中，有几篇文章可以归结为上面的<code>crunch4()</code>，但是没有它的内部循环 <blockquote> "SA-SSR: a suffix array-based algorithm for exhaustive and efficient SSR discovery in large genetic sequences." Pickett et alia <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5013907/" rel="nofollow noreferrer">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5013907/</a> </blockquote> <code>crunch4a()</code>总是很快，但有时是错误的。事实上，它为这里出现的每个示例找到至少一个最大重复拉伸，在几分之一秒内解决了较大的字典示例，并且对形式为<code>'x' * 1000000</code>的字符串没有问题。大部分时间用于构建后缀和lcp阵列。但它可能会失败： <pre><code>>>> x = "bcdabcdbcd" >>> crunch4(x) # finds repeated bcd at end (6, [(4, 3, 2)]) >>> crunch4a(x) # finds nothing (0, []) </code></pre> 问题是不能保证相关后缀在后缀数组中是相邻的。以“b”开头的后缀按如下顺序排列： <pre><code>bcd bcdabcdbcd bcdbcd </code></pre> 用这种方法找到后面的重复块需要比较第一个和第三个。这就是为什么<code>crunch4()</code>有一个内环，尝试所有以一个公共字母开头的对。相关对可以由后缀数组中任意数量的其他后缀分隔。但这也使得算法的时间是二次的 <pre><code># only look at adjacent entries - fast, but sometimes wrong def crunch4a(s): from sa import suffix_array sa, rank, lcp = suffix_array(s) bestlen, results = 0, [] n = len(s) for sai in range(1, n): i, j = sa[sai - 1], sa[sai] c = lcp[sai] w = abs(i - j) if c >= w: numreps = 1 + c // w total = w * numreps if total >= bestlen: if total > bestlen: results.clear() bestlen = total results.append((min(i, j), w, numreps)) return bestlen, results </code></pre> <h2>O（n日志n）</h2> 这篇论文对我来说很合适，尽管我还没有对它进行编码： <blockquote> "Simple and Flexible Detection of Contiguous Repeats Using a Suffix Tree" Jens Stoye, Dan Gusfield <a href="https://csiflabs.cs.ucdavis.edu/%7Egusfield/tcs.pdf" rel="nofollow noreferrer">https://csiflabs.cs.ucdavis.edu/~gusfield/tcs.pdf</a> </blockquote> 然而，使用次二次算法需要做出一些折衷。例如，<code>"x" * n</code>有<code>n-1</code>形式的子串<code>"x"*2</code>，<code>"x"*3</code>形式的<code>n-2</code>，…，因此只有<code>O(n**2)</code>这些子串。因此，找到所有子串的任何算法也必然处于最佳二次时间 详细阅读本文；-）您正在寻找的一个概念是“原语”：我相信您只想要形式<code>S*n</code>的重复，其中<code>S</code>本身不能表示为较短字符串的重复。例如，<code>"x" * 10</code>是原语，但<code>"xx" * 5</code>不是 <h2>通往<code>O(n log n)</code>的一步</h2> <code>crunch9()</code>是我在评论中提到的“暴力”算法的一个实现，来自： <blockquote> "The enhanced suffix array and its applications to genome analysis" Ibrahim et alia <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.2217&rep=rep1&type=pdf" rel="nofollow noreferrer">http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.2217&rep=rep1&type=pdf</a> </blockquote> 那里的实现草图只找到“分支串联”重复，我在这里添加了代码来推断任意数量的重复，并包括非分支重复。虽然它仍然是<code>O(n**2)</code>最坏的情况，但对于您在注释中指向的<code>seq</code>字符串，它比这里的任何东西都要快。照原样，它可以复制（除订单外）与这里的大多数其他程序一样详尽无遗 该报继续努力争取将最坏的情况削减到<code>O(n log n)</code>，但这大大减慢了它的速度。因此，它更努力了。我承认我失去了兴趣；-） <pre><code># Generate lcp intervals from the lcp array. def genlcpi(lcp): lcp.append(0) stack = [(0, 0)] for i in range(1, len(lcp)): c = lcp[i] lb = i - 1 while c < stack[-1][0]: i_c, lb = stack.pop() interval = i_c, lb, i - 1 yield interval if c > stack[-1][0]: stack.append((c, lb)) lcp.pop() def crunch9(text): from sa import suffix_array sa, rank, lcp = suffix_array(text) bestlen, results = 0, [] n = len(text) # generate branching tandem repeats def gen_btr(text=text, n=n, sa=sa): for c, lb, rb in genlcpi(lcp): i = sa[lb] basic = text[i : i + c] # Binary searches to find subrange beginning with # basic+basic. A more gonzo implementation would do this # character by character, never materialzing the common # prefix in `basic`. rb += 1 hi = rb while lb < hi: # like bisect.bisect_left mid = (lb + hi) // 2 i = sa[mid] + c if text[i : i + c] < basic: lb = mid + 1 else: hi = mid lo = lb while lo < rb: # like bisect.bisect_right mid = (lo + rb) // 2 i = sa[mid] + c if basic < text[i : i + c]: rb = mid else: lo = mid + 1 lead = basic[0] for sai in range(lb, rb): i = sa[sai] j = i + 2*c assert j <= n if j < n and text[j] == lead: continue # it's non-branching yield (i, c, 2) for start, c, _ in gen_btr(): # extend left numreps = 2 for i in range(start - c, -1, -c): if all(text[i+k] == text[start+k] for k in range(c)): start = i numreps += 1 else: break totallen = c * numreps if totallen < bestlen: continue if totallen > bestlen: bestlen = totallen results.clear() results.append((start, c, numreps)) # add non-branches while start: if text[start - 1] == text[start + c - 1]: start -= 1 results.append((start, c, numreps)) else: break return bestlen, results </code></pre> <h2>赢取奖励积分；-）</h2> 对于某些技术意义；-）<code>crunch11()</code>是最坏情况的O（n logn）。除了后缀和lcp数组之外，还需要<code>rank</code>数组^{<cd33的倒数： <pre><code>assert all(rank[sa[i]] == sa[rank[i]] == i for i in range(len(sa))) </code></pre> 正如代码注释所指出的，它还依赖于Python3的速度（<code>range()</code>行为）。这很肤浅，但重写起来会很乏味 描述这一点的文章有几个错误，所以如果这段代码与您读到的内容不完全匹配，请不要发火。相反，严格执行他们所说的，它将失败 也就是说，代码变得异常复杂，我不能保证没有bug。我试过的所有东西都能用 形式为<code>'x' * 1000000</code>的输入仍然不是快速的，但显然不再是二次时间。例如，一个字符串将同一个字母重复一百万次，在近30秒内完成。这里的大多数其他项目永远不会结束；-） 编辑：将<code>genlcpi()</code>更改为使用半开放Python范围；对<code>crunch11()</code>进行了大部分表面上的改变；增加了“提前退出”，在最糟糕的情况下（如<code>'x' * 1000000</code>）可以节省大约三分之一的时间 <pre><code># Generate lcp intervals from the lcp array. def genlcpi(lcp): lcp.append(0) stack = [(0, 0)] for i in range(1, len(lcp)): c = lcp[i] lb = i - 1 while c < stack[-1][0]: i_c, lb = stack.pop() yield (i_c, lb, i) if c > stack[-1][0]: stack.append((c, lb)) lcp.pop() def crunch11(text): from sa import suffix_array sa, rank, lcp = suffix_array(text) bestlen, results = 0, [] n = len(text) # Generate branching tandem repeats. # (i, c, 2) is branching tandem iff # i+c in interval with prefix text[i : i+c], and # i+c not in subinterval with prefix text[i : i+c + 1] # Caution: this pragmatically relies on that, in Python 3, # `range()` returns a tiny object with O(1) membership testing. # In Python 2 it returns a list - ahould still work, but very # much slower. def gen_btr(text=text, n=n, sa=sa, rank=rank): from itertools import chain for c, lb, rb in genlcpi(lcp): origlb, origrb = lb, rb origrange = range(lb, rb) i = sa[lb] lead = text[i] # Binary searches to find subrange beginning with # text[i : i+c+1]. Note we take slices of length 1 # rather than just index to avoid special-casing for # i >= n. # A more elaborate traversal of the lcp array could also # give us a list of child intervals, and then we'd just # need to pick the right one. But that would be even # more hairy code, and unclear to me it would actually # help the worst cases (yes, the interval can be large, # but so can a list of child intervals). hi = rb while lb < hi: # like bisect.bisect_left mid = (lb + hi) // 2 i = sa[mid] + c if text[i : i+1] < lead: lb = mid + 1 else: hi = mid lo = lb while lo < rb: # like bisect.bisect_right mid = (lo + rb) // 2 i = sa[mid] + c if lead < text[i : i+1]: rb = mid else: lo = mid + 1 subrange = range(lb, rb) if 2 * len(subrange) <= len(origrange): # Subrange is at most half the size. # Iterate over it to find candidates i, starting # with wa. If i+c is also in origrange, but not # in subrange, good: then i is of the form wwx. for sai in subrange: i = sa[sai] ic = i + c if ic < n: r = rank[ic] if r in origrange and r not in subrange: yield (i, c, 2, subrange) else: # Iterate over the parts outside subrange instead. # Candidates i are then the trailing wx in the # hoped-for wwx. We win if i-c is in subrange too # (or, for that matter, if it's in origrange). for sai in chain(range(origlb, lb), range(rb, origrb)): ic = sa[sai] - c if ic >= 0 and rank[ic] in subrange: yield (ic, c, 2, subrange) for start, c, numreps, irange in gen_btr(): # extend left crange = range(start - c, -1, -c) if (numreps + len(crange)) * c < bestlen: continue for i in crange: if rank[i] in irange: start = i numreps += 1 else: break # check for best totallen = c * numreps if totallen < bestlen: continue if totallen > bestlen: bestlen = totallen results.clear() results.append((start, c, numreps)) # add non-branches while start and text[start - 1] == text[start + c - 1]: start -= 1 results.append((start, c, numreps)) return bestlen, results </code></pre>

查找最长的相邻重复非重叠子字符串

1 个回答

相关Python问题