在字符串中搜索模式,如果找到则添加字符

2024-03-29 01:17:23 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在开发一些地址清理/地理编码软件,最近遇到了一种特殊的地址格式,这给我带来了一些问题。你知道吗

我的外部地理编码模块在查找30 w 60th new york30 w 60th street new york是地址的正确格式)等地址时遇到问题。你知道吗

我需要做的就是检查字符串:

  1. 有没有数字后跟thstndrd?(+后面的空格)。一、 电子33rd34th21st24th
  2. 如果是的话,后面是street这个词吗?你知道吗

如果是,什么也不做。你知道吗

如果否,在特定模式之后立即添加单词street?你知道吗

正则表达式是处理这种情况的最佳方法吗?你知道吗

进一步澄清:我对其他地址后缀没有任何问题,例如avenue、road等。我分析了非常大的数据集(我每天通过应用程序运行大约12000个地址),而忽略street的实例是最让我头疼的问题。我研究了地址解析模块,如usaddress、smartystreets和其他模块。我真的只需要拿出一个干净的(希望regex?)我所描述的具体问题的解决方案。你知道吗

我想的是:

  1. 将字符串转换为列表。你知道吗
  2. 在列表中找到符合我所解释的条件的元素的索引
  3. 检查下一个元素是否为street。如果是,什么也不做。你知道吗
  4. 如果不是,则用[:targetword + len(targetword)] + 'street' + [:targetword + len(targetword)]重新构建列表。(targetword将是47th或字符串中的任何内容)
  5. 将列表重新连接成一个字符串。你知道吗

我对regex不是很在行,所以我在寻找一些建议。你知道吗

谢谢。你知道吗


Tags: 模块字符串元素street编码列表newlen
3条回答

看来你在找regexp。=P

下面是我专门为您构建的一些代码:

import re


def check_th_add_street(address):
    # compile regexp rule
    has_th_st_nd_rd = re.compile(r"(?P<number>[\d]{1,3}(st|nd|rd|th)\s)(?P<following>.*)")

    # first check if the address has number followed by something like 'th, st, nd, rd'
    has_number = has_th_st_nd_rd.search(address)
    if has_number is not None:
        # then check if not followed by 'street'
        if re.match('street', has_number.group('following')) is None:
            # then add the 'street' word
            new_address = re.sub('(?P<number>[\d]{1,3}(st|nd|rd|th)\s)', r'\g<number>street ', address)
            return new_address
        else:
            return True # the format is good (followed by 'street')
    else:
        return True # there is no number like 'th, st, nd, rd'

我是python学习者,所以谢谢你让我知道它是否解决了你的问题。他说

在一小串地址上测试。他说

希望它能帮助你解决问题。他说

谢谢你!他说

编辑

如果后面紧跟着“大街”或“路”以及“街道”,则应注意:

import re


def check_th_add_street(address):
    # compile regexp rule
    has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,3}(th|st|nd|rd)\s)(?P<following>.*)')

    # first check if the address has number followed by something like 'th, st, nd, rd'
    has_number = has_th_st_nd_rd.search(address)
    if has_number is not None:
        # check if followed by "avenue" or "road" or "street"
        if re.match(r'(avenue|road|street)', has_number.group('following')):
            return True # do nothing
        # else add the "street" word
        else:
            # then add the 'street' word
            new_address = re.sub('(?P<number>[\d]{1,3}(st|nd|rd|th)\s)', r'\g<number>street ', address)
            return new_address
    else:
        return True # there is no number like 'th, st, nd, rd'

重新编辑

我根据您的需要做了一些改进,并添加了一个使用示例:

import re


# build the original address list includes bad format
address_list = [
    '30 w 60th new york',
    '30 w 60th new york',
    '30 w 21st new york',
    '30 w 23rd new york',
    '30 w 1231st new york',
    '30 w 1452nd new york',
    '30 w 1300th new york',
    '30 w 1643rd new york',
    '30 w 22nd new york',
    '30 w 60th street new york',
    '30 w 60th street new york',
    '30 w 21st street new york',
    '30 w 22nd street new york',
    '30 w 23rd street new york',
    '30 w brown street new york',
    '30 w 1st new york',
    '30 w 2nd new york',
    '30 w 116th new york',
    '30 w 121st avenue new york',
    '30 w 121st road new york',
    '30 w 123rd road new york',
    '30 w 12th avenue new york',
    '30 w 151st road new york',
    '30 w 15th road new york',
    '30 w 16th avenue new york'
]


def check_th_add_street(address):
    # compile regexp rule
    has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,4}(th|st|nd|rd)\s)(?P<following>.*)')

    # first check if the address has number followed by something like 'th, st, nd, rd'
    has_number = has_th_st_nd_rd.search(address)
    if has_number is not None:
        # check if followed by "avenue" or "road" or "street"
        if re.match(r'(avenue|road|street)', has_number.group('following')):
            return address # return original address
        # else add the "street" word
        else:
            new_address = re.sub('(?P<number>[\d]{1,4}(st|nd|rd|th)\s)', r'\g<number>street ', address)
            return new_address
    else:
        return address # there is no number like 'th, st, nd, rd' -> return original address


# initialisation of the new list
new_address_list = []

# built the new clean list
for address in address_list:
    new_address_list.append(check_th_add_street(address))
    # or you could use it straight here i.e. :
    # address = check_th_add_street(address)
    # print address

# use the new list to do you work
for address in new_address_list:
    print "Formated address is : %s" % address # or what ever you want to do with 'address'

将输出:

Formated address is : 30 w 60th street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 21st street new york
Formated address is : 30 w 23rd street new york
Formated address is : 30 w 1231st street new york
Formated address is : 30 w 1452nd street new york
Formated address is : 30 w 1300th street new york
Formated address is : 30 w 1643rd street new york
Formated address is : 30 w 22nd street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 60th street new york
Formated address is : 30 w 21st street new york
Formated address is : 30 w 22nd street new york
Formated address is : 30 w 23rd street new york
Formated address is : 30 w brown street new york
Formated address is : 30 w 1st street new york
Formated address is : 30 w 2nd street new york
Formated address is : 30 w 116th street new york
Formated address is : 30 w 121st avenue new york
Formated address is : 30 w 121st road new york
Formated address is : 30 w 123rd road new york
Formated address is : 30 w 12th avenue new york
Formated address is : 30 w 151st road new york
Formated address is : 30 w 15th road new york
Formated address is : 30 w 16th avenue new york

重新编辑

最后一个函数:将count参数添加到回复sub()

def check_th_add_street(address):
    # compile regexp rule
    has_th_st_nd_rd = re.compile(r'(?P<number>[\d]{1,4}(th|st|nd|rd)\s)(?P<following>.*)')

    # first check if the address has number followed by something like 'th, st, nd, rd'
    has_number = has_th_st_nd_rd.search(address)
    if has_number is not None:
        # check if followed by "avenue" or "road" or "street"
        if re.match(r'(avenue|road|street)', has_number.group('following')):
            return address # do nothing
        # else add the "street" word
        else:
            # then add the 'street' word
            new_address = re.sub('(?P<number>[\d]{1,4}(st|nd|rd|th)\s)', r'\g<number>street ', address, 1) # the last parameter is the maximum number of pattern occurences to be replaced
            return new_address
    else:
        return address # there is no number like 'th, st, nd, rd'

虽然您当然可以使用regex来解决这类问题,但我还是忍不住想,很可能有一个Python库已经为您解决了这个问题。我从来没用过这些,但只要快速搜索一下就会发现:

https://github.com/datamade/usaddress

https://pypi.python.org/pypi/postal-address

https://github.com/SwoopSearch/pyaddress

PyParsing还有一个地址示例,您可以查看:http://pyparsing.wikispaces.com/file/view/streetAddressParser.py

您还可以看看前面的问题:is there a library for parsing US addresses?

有什么理由不能仅仅使用第三方库来解决问题吗?他说

编辑:他们的网址:https://github.com/pyparsing/pyparsing

您可以通过将这些字符串中的每一个转换为列表,并在这些列表中查找特定的字符组来实现这一点。例如:

def check_th(address):
    addressList = list(address)
    for character in addressList:
        if character == 't':
             charIndex = addressList.index(character)
             if addressList[charIndex + 1] == 'h':
                 numberList = [addressList[charIndex - 2], addressList[charIndex - 1]]
                 return int(''.join(str(x) for x in numberList))

这看起来很混乱,但它应该完成工作,只要数字是两位数长。然而,如果有许多事情需要你去寻找,你可能应该寻找一个更方便,更简单的方法来做到这一点。他说

相关问题 更多 >