如何在python中找到常见的子字符串?

2024-04-25 17:03:00 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在编写一个python脚本,其中必须在许多字符串序列中找到公共子字符串。 例如:

sequence1 = 'mweitngaomjksjasper;36nnG1bmaso3th7a\-'
sequence2 = 'asngiqbwebs7-236jasper;u52dsv--4512G1b'
sequence3 = 'asvjaspermininwqmamnf-121xvxnesgq232'

jasper出现3次-在sequence1、sequence2和sequence3中各出现一次。 G1b出现2次-一次在序列1中,一次在序列2中。你知道吗

对于出现两次或两次以上的每个子字符串,我需要将它们添加到字典中,作为substring=>;count。 在这种情况下,我的字典是:

dict = { 'jasper': '3', 'G1b': '2'}

我将使用数千个序列来填充这个字典,如果一个子串在任何一个序列中出现两次或更多次,它将 需要添加到此词典中。在不破坏系统的情况下,最好的方法是什么?你知道吗


Tags: 字符串脚本字典情况序列substringjaspersequence1
2条回答

这是一种方法:

def all_prefixes(x, minlen):
  for i in range(minlen, len(x)):
    yield x[:i]


def all_substrings(x, minlen=1):
  if len(x) < minlen:
    return
  yield from all_prefixes(x, minlen)
  yield from all_substrings(x[1:], minlen)


from collections import Counter
words = [
  'mweitngaomjksjasper;36nnG1bmaso3th7a\-',
  'asngiqbwebs7-236jasper;u52dsv 4512G1b',
  'asvjaspermininwqmamnf-121xvxnesgq232'
]
print(dict((k,v) for k,v in Counter(x for w in words for x in all_substrings(w, minlen=3)).items() if v >= 2))

打印至少出现两次且最小长度为3的所有子字符串的计数:

{'jasper': 3, 'jasper;': 2, 'asper;': 2, 'sper': 3, 'er;': 2, 'jasp': 3, 'per;': 2, 'spe': 3, 'jas': 3, 'asp': 3, 'asper': 3, 'aspe': 3, 'per': 3, 'sper;': 2, 'jaspe': 3}

首先,我们将编写一个快速的小生成器,它接受一个字符串并生成该字符串的每个子字符串

from collections import Counter
import itertools

def substrings(s):
    for i in range(len(s)):
        for j in range(i+1, len(s)+1):
            yield s[i:j]

sequences = ['mweitngaomjksjasper;36nnG1bmaso3th7a\-',
             'asngiqbwebs7-236jasper;u52dsv 4512G1b',
             'asvjaspermininwqmamnf-121xvxnesgq232']

c = Counter(itertools.chain.from_iterable(s for s in map(substrings, sequences)))

然后我们可以使用itertools.takewhile只获取那些多次出现的子字符串

print(list(itertools.takewhile(lambda x: x[1] > 1, c.most_common())))

印刷品

[('s', 10), ('a', 9), ('n', 8), ('2', 6), ('e', 6), ('as', 6), ('m', 6), ('1', 5), ('-', 5), ('3', 4), ('i', 4), ('j', 4), ('b', 4), ('q', 3), ('er', 3), ('r', 3), ('asper', 3), ('g', 3), ('per', 3), ('v', 3), ('jaspe', 3), ('ja', 3), ('sp', 3), ('spe', 3), ('aspe', 3), ('sper', 3), ('jas', 3), ('asp', 3), ('w', 3), ('jasper', 3), ('p', 3), ('pe', 3), ('jasp', 3), ('o', 2), ('ma', 2), ('r;', 2), ('23', 2), ('12', 2), ('jasper;', 2), ('1b', 2), ('G1b', 2), ('asper;', 2), ('t', 2), ('sv', 2), ('5', 2), ('36', 2), ('per;', 2), ('x', 2), ('in', 2), ('6', 2), ('G1', 2), ('G', 2), ('7', 2), ('er;', 2), ('we', 2), (';', 2), ('ng', 2), ('sper;', 2)]

相关问题 更多 >