如何从匹配中返回字典列表正则表达式findall?

2024-05-16 10:56:37 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在处理几百个文档,我正在编写一个函数,它将查找特定的单词及其值并返回字典列表。在

我在专门寻找一条具体的信息(“城市”和引用它的号码)。然而,在一些文档中,我有一个城市,而在其他文档中,我可能有20个甚至100个,所以我需要一些非常通用的东西。在

一个文本示例(括号是这样混乱的):

text = 'The territory of modern Hungary was for centuries inhabited by a succession of peoples, including Celts, Romans, Germanic tribes, Huns, West Slavs and the Avars. The foundations of the Hungarian state was established in the late ninth century AD by the Hungarian grand prince Árpád following the conquest of the Carpathian Basin. According to previous census City: Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). However etc etc'

或者

^{pr2}$

使用regex,我找到了要查找的字符串:

p = regex.compile(r'(?<=City).(.*?)(?=However)')
m = p.findall(text)

以列表形式返回整个文本。在

[' Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). ']

现在,这就是我被困的地方,我不知道该怎么做。我应该用吗正则表达式findall或者regex.finditer?在

考虑到文档中“城市”的数量各不相同,我想拿回一份字典列表。如果我在文本2中运行,我会得到:

d = [{'cities': 'Eger', 'population': '32,352'}] 

如果我在文本一中运行:

d = [{'cities': 'Szeged', 'population': '104,867'}, {'cities': 'Miskolc': 'population': 109,841'}]

我真的很感谢你的帮助,伙计们!在


Tags: ofthetext文档文本列表by字典
2条回答

@Wiktor的回答很好。因为我花了一些时间在这上面,我张贴我的答案

d = [' Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). ']
oo = []
import re
for i in d[0].split(")"):
    jj = re.search("[0-9,]+", i)
    kk, *xx = i.split()
    if jj:
        oo.append({"cities": kk , "population": jj.group()})
print (oo)

#Result > [{'cities': 'Budapest', 'population': '1,590,316'}, {'cities': 'Debrecen', 'population': '115,399'}, {'cities': 'Szeged', 'population': '104,867'}, {'cities': 'Miskolc', 'population': '109,841'}]

您可以将re.finditer与正则表达式一起使用,该正则表达式在匹配文本上使用x.groupdict()命名捕获组(以您的键命名),以获得结果字典:

import re
text = 'The territory of modern Hungary was for centuries inhabited by a succession of peoples, including Celts, Romans, Germanic tribes, Huns, West Slavs and the Avars. The foundations of the Hungarian state was established in the late ninth century AD by the Hungarian grand prince Árpád following the conquest of the Carpathian Basin. According to previous census City: Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). However etc etc'
p = re.compile(r'City:\s*(.*?)However')
p2 = re.compile(r'(?P<city>\w+)\s*\([^()\d]*(?P<population>\d[\d,]*)')
m = p.search(text)
if m:
    print([x.groupdict() for x in p2.finditer(m.group(1))])

# => [{'population': '1,590,316', 'city': 'Budapest'}, {'population': '115,399', 'city': 'Debrecen'}, {'population': '104,867', 'city': 'Szeged'}, {'population': '109,841', 'city': 'Miskolc'}]

参见Python 3 demo online。在

第二个p2正则表达式是

^{pr2}$

参见regex demo。在

在这里

  • (?P<city>\w+)-组“城市”:1+字字符
  • \s*\(-0+空格和(
  • [^()\d]*-除()和数字以外的任何0+字符
  • (?P<population>\d[\d,]*)-组“population”:后跟0+个数字或/和逗号的数字

您可以尝试对整个原始字符串运行p2正则表达式(请参见demo),但它可能会过度匹配。在

相关问题 更多 >