在python中使用regex提取多个特定单词之间的子字符串

2024-03-29 04:37:52 发布

您现在位置:Python中文网/ 问答频道 /正文

正则表达式子字符串

我想从字符串中提取电话、传真、手机如果没有,它可以返回空字符串。我要3个电话,传真,从任何给定的文本字符串的例子移动列表如下所示。你知道吗

ex1 = "miramar road margie shoop san diego ca 12793 manager  phone 6035550160 fax 6035550161 mobile 6035550178  marsgies travel  wwwmarpiestravelcom"
ex2 = "david packard electrical engineering  350 serra mall room 170 phone 650 7259327  stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu"
ex3 = "stanford  electrical  engineering  vijay chandrasekhar  electrical engineering 17 comstock circle apt 101  stanford ca 94305  phone 9162210411"

正则表达式可以这样:

phone_regex  = re.match(".*phone(.*)fax(.*)mobile(.*)",ex1)
phone = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][0]
mobile = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][2]
fax = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][1]

Result from ex1:
phone = 6035550160
fax = 6035550161
mobile = 6035550178

ex2没有mobile条目,因此我得到:

Traceback (most recent call last):
phone = [re.sub("[^0-9]", "", x) for x in phone_regex.groups()][0]
AttributeError: 'NoneType' object has no attribute 'groups'

问题
我需要一个更好的正则表达式解决方案,因为我是正则表达式的新手, 或者,一个解决方案,捕获AttributeError并分配null string。你知道吗


Tags: 字符串inreforphonemobileelectricalregex
3条回答

您可以这样使用一个简单的re.findall

dict(re.findall(r'\b({})\s*(\d+)'.format("|".join(keys)), ex))

正则表达式看起来像

\b(phone|fax|mobile)\s*(\d+)

参见regex demo online。你知道吗

图案细节

  • \b-单词边界
  • (phone|fax|mobile)-第1组:列出的单词之一
  • \s*-0+空格
  • (\d+)-第2组:一个或多个数字

参见Python demo

import re
exs = ["miramar road margie shoop san diego ca 12793 manager  phone 6035550160 fax 6035550161 mobile 6035550178  marsgies travel  wwwmarpiestravelcom",
   "david packard electrical engineering  350 serra mall room 170 phone 650 7259327  stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu", 
   "stanford  electrical  engineering  vijay chandrasekhar  electrical engineering 17 comstock circle apt 101  stanford ca 94305  phone 9162210411"]
keys = ['phone', 'fax', 'mobile']
for ex in exs:
    res = dict(re.findall(r'\b({})\s*(\d+)'.format("|".join(keys)), ex))
    print(res)

输出:

{'fax': '6035550161', 'phone': '6035550160', 'mobile': '6035550178'}
{'fax': '650', 'phone': '650'}
{'phone': '9162210411'}

我想我知道你想要什么。。它与在关键字之后获得准确的第一个匹配有关。在这种情况下你需要的是问号?地址:

“'?”也是一个量词。是{0,1}的缩写。意思是“匹配这个问号前面的零个或一个组”,也可以解释为问号前面的部分是可选的

这里有一些代码应该可以工作,以防定义不够

import re
res_dict = {}
list_keywords = ['phone', 'cell', 'fax']
for i_key in list_keywords:
    temp_res = re.findall(i_key + '(.*?) [a-zA-Z]', ex1)
    res_dict[i_key] = temp_res

使用re.search

演示:

import re

ex1 = "miramar road margie shoop san diego ca 12793 manager  phone 6035550160 fax 6035550161 mobile 6035550178  marsgies travel  wwwmarpiestravelcom"
ex2 = "david packard electrical engineering  350 serra mall room 170 phone 650 7259327  stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu"
ex3 = "stanford  electrical  engineering  vijay chandrasekhar  electrical engineering 17 comstock circle apt 101  stanford ca 94305  phone 9162210411"

for i in [ex1, ex2, ex3]:
    phone = re.search(r"(?P<phone>(?<=\phone\b).*?(?=([a-z]|$)))", i)
    if phone:
        print "Phone: ", phone.group("phone")

    fax = re.search(r"(?P<fax>(?<=\bfax\b).*?(?=([a-z]|$)))", i)
    if fax:
        print "Fax: ", fax.group("fax")

    mob = re.search(r"(?P<mob>(?<=\bmobile\b).*?(?=([a-z]|$)))", i)
    if mob:
        print "mob: ", mob.group("mob")
    print("-----")

输出:

Phone:   6035550160 
Fax:   6035550161 
mob:   6035550178  
-----
Phone:   650 7259327  
Fax:   650 723 1882 
-----
Phone:   9162210411
-----

相关问题 更多 >