如果在一个模式后大写,则抓取一个或两个单词,并将结果与另一个lis匹配

2024-06-02 07:14:21 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要从文本中提取具有标题的唯一名称,例如Lord | Baroness | Lady | Baron,并将其与另一个列表匹配。我努力得到正确的结果,希望社会各界能帮助我。谢谢

import re
def get_names(text):
    # find nobel titles and grab it with the following name
    match = re.compile(r'(Lord|Baroness|Lady|Baron) ([A-Z][a-z]+) ([A-Z][a-z]+)')
    names = list(set(match.findall(text)))
    # remove duplicates based on the index in tuples
    names_ = list(dict((v[1],v) for v in sorted(names, key=lambda names: names[0])).values())
    names_lst = list(set([' '.join(map(str, name)) for name in names_]))
    return names_lst

text = 'Baroness Firstname Surname and Baroness who is also known as Lady Anothername and Lady Surname or Lady Firstname.'
names_lst = get_names(text)
print(names_lst)

现在产生:['Baroness Firstname Surname']

所需输出:['Baroness Firstname Surname', 'Lady Anothername']但不是Lady SurnameLady Firstname

然后我需要将结果与此列表匹配:

other_names = ['Firstname Surname', 'James', 'Simon Smith']

并从中删除元素'Firstname Surname',因为它与“所需输出”中男爵夫人的名字和姓氏匹配


Tags: andtextnameinre列表namessurname
1条回答
网友
1楼 · 发布于 2024-06-02 07:14:21

我建议您采用以下解决方案:

import re

def get_names(text):
    # find nobel titles and grab it with the following name
    match = re.compile(r'(Lord|Baroness|Lady|Baron) ([A-Z][a-z]+)[ ]?([A-Z][a-z]+)?')
    names = list(match.findall(text))
    # keep only the first title encountered
    d = {}
    for name in names:
        if name[0] not in d:
            d[name[0]] = ' '.join(name[1:3]).strip()
    return d

text = 'Baroness Firstname Surname and Baroness who is also known as Lady Anothername and Lady Surname or Lady Firstname.'
other_names = ['Firstname Surname', 'James', 'Simon Smith']

names_dict = get_names(text)
print(names_dict)
#  {'Baroness': 'Firstname Surname', 'Lady': 'Anothername'}
print([' '.join([k,v]) for k,v in names_dict.items()])
# ['Baroness Firstname Surname', 'Lady Anothername']

other_names_dropped = [name for name in other_names if name not in names_dict.values()]
print(other_names_dropped)
# ['James', 'Simon Smith']

相关问题 更多 >