使用正则表达式从文本中提取键和值

2024-04-29 13:25:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我需要解析大量字符串。这些字符串包含放置在键值对中的信息

输入文本示例:

Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim: ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim: ad minima veniam, *31.12.2012, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur

关键信息:

  • 键从字符串的开头或在\. 之后开始
  • 键总是以:结尾
  • 键后面紧跟着一个值
  • 此值持续到下一个键或字符串中的最后一个符号
  • 有多个键值对,我不知道

预期产量

{
    "Nemo enim": "ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem",
    
    "Ut enim": "ad minima veniam, *31.12.2012, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur. Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur"
}

到目前为止,我使用的正则表达式是([üöä\w\s]*)\: (.*?)\.。可以说它没有提供预期的输出


Tags: 字符串seduteasitautquivel
3条回答

这个正则表达式([^:.]+):\s*([^:]+)(?=\.\s+|$)执行此任务

Demo & explanation

为了好玩,这里有一个python的非正则解决方案:

latin = """[the sample input text]"""
new_lat = latin.replace(":","xxx:").split('xxx')
for l in new_lat:
    if ":" in l:        
        curr_ind = new_lat.index(l)
        cur_brek = l.rfind('. ')
        prev_brek = new_lat[curr_ind-1].rfind('. ')
        stub = new_lat[curr_ind-1][prev_brek+2:]
        new_l = stub+l[:cur_brek]
        print(new_l)

输出是从键开始的两个文本块

您可以匹配以下正则表达式,它保存键和值以捕获组1和2

r'(?<![^.]) *([^.]+?:) *((?:(?!\. ).)+)'

Start your engine!Python code

Python的正则表达式引擎执行以下操作

(?<![^.])    : negative lookbehind asserts current location is not
               preceded by a character other than '.'
\ *          : match 0+ spaces
(            : begin capture group 1
  [^.]+?     : match 1+ characters other than '.', lazily
  :          : match ':'
)            : end capture group 1
\ *          : match 0+ spaces
(            : begin capture group 2
  (?:        : begin non-capture group
    (?!\. )  : negative lookahead asserts current position is not
               followed by a period followed by a space
    .        : match any character other than a line terminator
  )+         : end non-capture group and execute 1+ times
)            : end capture group 2

这使用了tempered greedy token技术,它匹配一系列不以不需要的字符串开头的单个字符。例如,如果字符串是"concatenate"(?:(?:!cat).)+将匹配前三个字母,但不匹配第二个'c',因此匹配将是'con'

相关问题 更多 >