在Python中标识相似的字符串

AAAAAAAAAAAA #Start checking at line 1 TTTTTTTTTTTT #Diff by >1 char: Keep AAAAACAAAAAA #Diff by 1 char: Delete AAAAACAAACAA #Diff by 2 char: Keep AAAAAAAAAAAA #Diff by <1 char: Delete

with open(current_file, 'r') as f: lineCharsList = [] outLines = [] for line in f: lineChars = line[:] if not (lineChars in lineCharsList): #exactly matches lines, need partial matching lineCharsList.append(lineChars) outLines.append(line) print line

2条回答

网友

1楼 · 编辑于 2024-05-14 20:15:14

你已经有一个很好的答案了。你知道吗

下面是我在basic python中的实现

with open(current_file, 'r') as f:
    outlines = []
    for line in f:
        z = zip(line, *[el for el in outlines])
        matches = [el[0] in el[1:] for el in z]
        if matches.count(False) > 1:
            outlines.append(line)

网友

2楼 · 编辑于 2024-05-14 20:15:14

pip install python-levenshtein并使用函数^{}比较字符串。你知道吗

hamming(string1, string2) Compute Hamming distance of two strings.
The Hamming distance is simply the number of differing characters. That means the length of the strings must be the same.
Examples:
>>> hamming('Hello world!', 'Holly grail!') 7
>>> hamming('Brian', 'Jesus') 5

代码是：

import Levenshtein

input_lines = [
    "AAAAAAAAAAAA",
    "TTTTTTTTTTTT",    # Diff by >1 char: Keep
    "AAAAACAAAAAA",    # Diff by 1 char: Delete
    "AAAAACAAACAA",    # Diff by 2 char: Keep
    "AAAAAAAAAAAA",    # Diff by <1 char: Delete
    ]
output_lines = []

for current_line in input_lines:
    for previous_line in output_lines:
        if Levenshtein.hamming(previous_line, current_line) < 2:
            break
    else:
        output_lines.append(current_line)

print('\n'.join(output_lines))

输出：

AAAAAAAAAAAA
TTTTTTTTTTTT
AAAAACAAACAA

相关问题更多 >

编程相关推荐

热门问题

热门文章