Python中不同字长字符串的"in"比较

4 投票

3 回答

2617 浏览

提问于 2025-04-17 16:17

我正在处理一个名字的数据库，里面可能有重复的条目，我想找出哪些名字是重复的。不过，数据的格式有点乱，有些条目把名字、姓氏和中间名都挤在了一起，而有些则只有名字和姓氏。

我需要一种方法来判断，比如说'John Marvulli'是否和'John Michael Marvulli'匹配，并且能够对这些匹配的名字进行一些操作。但是如果你尝试：

>>> 'John Marvulli' in 'John Michael Marvulli'
False

它会返回False。有没有简单的方法可以比较两个字符串，看看一个名字是否包含在另一个名字里呢？

数据处理字符串比较数据清洗文本分析名字匹配重复条目检测字符串包含

3 个回答

import re

n1 = "john Miller"
n1 = "john   Miller"

n2 = "johnas Miller"

n3 = "john doe Miller"
n4 = "john doe paul Miller"


regex = "john \\s*(\\w*\\s*)*\\s* Miller"
compiled=re.compile(regex)

print(compiled.search(n1)==None)
print(compiled.search(n2)==None)
print(compiled.search(n3)==None)
print(compiled.search(n4)==None)

'''
output:


False
True
False
False
'''

当然可以！请把你想要翻译的内容发给我，我会帮你用简单易懂的语言解释清楚。

回答于 2025-04-17 由 Python大师

分享举报

你需要把字符串分开，然后找出里面的每一个单词：

>>> all(x in 'John Michael Marvulli'.split() for x in 'John Marvulli'.split())
True

回答于 2025-04-17 由 Python大师

分享举报

我最近发现了difflib模块的强大功能。
我觉得这对你会有帮助：

import difflib

datab = ['Pnk Flooyd', 'John Marvulli',
         'Ld Zeppelin', 'John Michael Marvulli',
         'Led Zepelin', 'Beetles', 'Pink Fl',
         'Beatlez', 'Beatles', 'Poonk LLoyds',
         'Pook Loyds']
print datab
print


li = []
s = difflib.SequenceMatcher()

def yield_ratios(s,iterable):
    for x in iterable:
        s.set_seq1(x)
        yield s.ratio()

for text_item in datab:
    s.set_seq2(text_item)
    for gathered in li:
        if any(r>0.45 for r in yield_ratios(s,gathered)):
            gathered.append(text_item)
            break
    else:
        li.append([text_item])


for el in li:
    print el

结果

['Pnk Flooyd', 'Pink Fl', 'Poonk LLoyds', 'Pook Loyds']
['John Marvulli', 'John Michael Marvulli']
['Ld Zeppelin', 'Led Zepelin']
['Beetles', 'Beatlez', 'Beatles']

回答于 2025-04-17 由 Python大师

分享举报

Python中不同字长字符串的"in"比较

3 个回答

撰写回答