如何使用python在一个大文本文件的几行中提取实体名称

with open(filepath) as f: count=0 for line in f: if line.find("----") == -1 and line != '\n' and re.search( "Company|Rent", line) == None: if re.match('^[a-zA-Z]', line) is not None: name = re.findall(r'\b([a-zA-Z]+)\b', line) name = ' '.join(name) print('name', name) elif re.match('^[0-9]', line) is not None: number = line.split(' ', 1)[0] out = str(number) + ', ' + str(name) out = out.split(', ') print(out)

2条回答

网友

1楼 · 编辑于 2024-04-28 15:46:43

您可以使用以下带有标志/gmi的正则表达式

^Company\s+Rent\r?\n   *\s+-*\r?\n([a-z]+(?: [a-z]+)*).*\r?\n(?:([a-z]+(?: [a-z]+)*).*\r?\n)?(\d+)\s*\r?\n([a-z]+(?: [a-z]+)*).*\r?\n(?:([a-z]+(?: [a-z]+)*).*\r?\n)?(\d+)

Python demo

此正则表达式有六个捕获组：

公司名称，第1行
公司名称，第2行（可选）
公司名称后的数字标识符
组名，第1行
组名称，第2行（可选）
组名后的数字标识符

如果公司（组）名称仅在一行上，则捕获组2（5）将为nil。如果公司名称始终位于两行，而组名称始终位于一行，如示例中所示，则可以相应地简化正则表达式。如果公司或集团名称可以跨越两行以上，则必须相应地修改正则表达式

此正则表达式执行以下操作

^
Company\s+Rent\r?\n # match line
   *\s+-*\r?\n   # match line

(               # begin cap grp 1 (company name 1)
  [a-z]+        # match 1+ ltrs 
  (?: [a-z]+)   # match 1 space, 1+ ltrs in non-cap grp
  *             # execute non-cap grp 0+ times
)               # end cap grp 1 
.*\r?\n         # match remainder of line

(?:             # begin non-cap grp
  (             # begin cap grp  2  (opt. company name 2)             
    [a-z]+      # match 1+ ltrs
    (?: [a-z]+) # match 1 space, 1+ ltrs in non-cap grp
    *           # execute non-cap grp 0+ times
  )             # end cap grp 2
  .*\r?\n       # match remainder of line
)               # end non-cap group 
?               # optionally match non-cap grp

(\d+)           # match 1+ digits in cap grp 3 (company id)
\s*\r?\n        # match remainder of line

(               # begin cap grp 4 (group name 1)
  [a-z]+        # match 1+ ltrs
  (?: [a-z]+)   # match 1 space, 1+ ltrs in non-cap grp
  *             # execute non-cap grp 0+ times
)               # end cap grp 4
.*\r?\n         # match remainder of line

(?:             # begin non-cap grp
  (             # begin cap grp 5 (opt. group name 2)
    [a-z]+      # match 1+ ltrs
    (?: [a-z]+) # match 1 space, 1+ ltrs in non-cap grp
    *           # execute non-cap grp 0+ times
  )             # end cap grp 5
  .*\r?\n       # match remainder of line
)               # end non-cap grp
?               # optionally match non-cap grp

(\d+)           # match 1+ digits in cap grp 6 (group id)

我知道Python的正则表达式引擎不支持子例程。这是不幸的，因为使用子例程会大大简化正则表达式。例如，PCRE（PHP）引擎将允许将第一个([a-z]+(?: [a-z]+))之后的每个实例替换为((?1))

网友

2楼 · 编辑于 2024-04-28 15:46:43

仅稍微修改了您的代码：

with open(filepath) as f:
    name = ''
    for line in f:
        if line and line.find("  ") == -1 and re.search(
                "Company|Rent", line) is None:
            if re.match('^[a-zA-Z]', line) is not None:
                names = re.findall(r'\b([a-zA-Z]+)\b', line)
                names = ' '.join(names)
                name += names
            elif re.match('^[0-9]', line) is not None:
                number = line.split(' ', 1)[0]
                print([number, name])
                name = ''

这假设您已经正确地分离了垃圾，并且逻辑本身是正确的。主要修复方法是在后续行中连接名称部分

如果使用上述文件内容（将垃圾行替换为与正则表达式不匹配的内容），我会得到：

['2135', 'Andy Candy Store']
['4512', 'Moody Group']

`( # begin cap grp 1 (company name 1) [a-z]+ # match 1+ ltrs (?: [a-z]+) # match 1 space, 1+ ltrs in non-cap grp * # execute non-cap grp 0+ times ) # end cap grp 1 .*\r?\n # match remainder of line`

`(?: # begin non-cap grp ( # begin cap grp 2 (opt. company name 2) [a-z]+ # match 1+ ltrs (?: [a-z]+) # match 1 space, 1+ ltrs in non-cap grp * # execute non-cap grp 0+ times ) # end cap grp 2 .*\r?\n # match remainder of line ) # end non-cap group ? # optionally match non-cap grp`

`(\d+) # match 1+ digits in cap grp 3 (company id) \s*\r?\n # match remainder of line`

`( # begin cap grp 4 (group name 1) [a-z]+ # match 1+ ltrs (?: [a-z]+) # match 1 space, 1+ ltrs in non-cap grp * # execute non-cap grp 0+ times ) # end cap grp 4 .*\r?\n # match remainder of line`

`(?: # begin non-cap grp ( # begin cap grp 5 (opt. group name 2) [a-z]+ # match 1+ ltrs (?: [a-z]+) # match 1 space, 1+ ltrs in non-cap grp * # execute non-cap grp 0+ times ) # end cap grp 5 .*\r?\n # match remainder of line ) # end non-cap grp ? # optionally match non-cap grp`

`(\d+) # match 1+ digits in cap grp 6 (group id)`
我知道Python的正则表达式引擎不支持子例程。这是不幸的，因为使用子例程会大大简化正则表达式。例如，PCRE（PHP）引擎将允许将第一个`([a-z]+(?: [a-z]+))`之后的每个实例替换为`((?1))`

相关问题更多 >

编程相关推荐

热门问题

热门文章