在Python中标识相似的字符串

2024-05-14 20:15:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经生成了一个经过编辑的DNA测序文件,它在不同的行上有单独的读取。并希望消除那些在另一行的一个字符内匹配的字符。你知道吗

输入文件:

AAAAAAAAAAAA    #Start checking at line 1
TTTTTTTTTTTT    #Diff by >1 char: Keep
AAAAACAAAAAA    #Diff by 1 char: Delete
AAAAACAAACAA    #Diff by 2 char: Keep
AAAAAAAAAAAA    #Diff by <1 char: Delete

输出文件:

AAAAAAAAAAAA
TTTTTTTTTTTT
AAAAACAAACAA

到目前为止我所拥有的:

with open(current_file, 'r') as f:
    lineCharsList = []
    outLines = []
    for line in f:
        lineChars = line[:]

        if not (lineChars in lineCharsList):    #exactly matches lines, need partial matching
            lineCharsList.append(lineChars)
            outLines.append(line)
            print line

Tags: 文件inbylinediffdelete字符keep
2条回答

你已经有一个很好的答案了。你知道吗

下面是我在basic python中的实现

with open(current_file, 'r') as f:
    outlines = []
    for line in f:
        z = zip(line, *[el for el in outlines])
        matches = [el[0] in el[1:] for el in z]
        if matches.count(False) > 1:
            outlines.append(line)

pip install python-levenshtein并使用函数^{}比较字符串。你知道吗

hamming(string1, string2) Compute Hamming distance of two strings.

The Hamming distance is simply the number of differing characters. That means the length of the strings must be the same.

Examples:

>>> hamming('Hello world!', 'Holly grail!') 7
>>> hamming('Brian', 'Jesus') 5

代码是:

import Levenshtein

input_lines = [
    "AAAAAAAAAAAA",
    "TTTTTTTTTTTT",    # Diff by >1 char: Keep
    "AAAAACAAAAAA",    # Diff by 1 char: Delete
    "AAAAACAAACAA",    # Diff by 2 char: Keep
    "AAAAAAAAAAAA",    # Diff by <1 char: Delete
    ]
output_lines = []

for current_line in input_lines:
    for previous_line in output_lines:
        if Levenshtein.hamming(previous_line, current_line) < 2:
            break
    else:
        output_lines.append(current_line)

print('\n'.join(output_lines))

输出:

AAAAAAAAAAAA
TTTTTTTTTTTT
AAAAACAAACAA

相关问题 更多 >

    热门问题