Python中的模糊智能数解析

2条回答

网友

1楼 · 编辑于 2024-05-12 17:37:29

我重新修改了你的代码。这和下面的valid_number函数应该可以做到这一点。在

不过，我花时间写这段糟糕的代码的主要原因是向未来的读者展示，如果你不知道如何使用regexp（比如我），解析正则表达式会变得多么糟糕。

希望，比我更了解regexp的人可以向我们展示如何应该完成它：）

约束

.、,和{}被接受为千位分隔符和十进制分离器
不超过两个不同的分离器
最多有一个分隔符出现多次
如果只有一个分隔符并且只有一个这样的分隔符，则将分隔符视为十进制分隔符。（即123,456被解释为123.456，而不是{}）
字符串按双空格（' '）拆分为数字列表
除第一部分外，一千个分隔数的所有部分都必须是3位数长（123,456.00和{}都被认为是有效的，但是2345,11.00不被认为是vald）

代码

import re

from itertools import combinations

def extract_number(value):
    if (isinstance(value, int)) or (isinstance(value, float)):
        yield float(value)
    else:
        #Strip the string for leading and trailing whitespace
        value = value.strip()
        if len(value) == 0:
            raise StopIteration
        for s in value.split('  '):
            s = re.sub(r'&#\d+', '', s)
            s = re.sub(r'[^\-\s0-9\,\.]', ' ', s)
            s = s.replace(' ', '')
            if len(s) == 0:
                continue
            if not valid_number(s):
                continue
            if not sum(s.count(sep) for sep in [',', '.', '\'']):
                yield float(s)
            else:
                s = s.replace('.', '@').replace('\'', '@').replace(',', '@')
                integer, decimal = s.rsplit('@', 1)
                integer = integer.replace('@', '')
                s = '.'.join([integer, decimal])
                yield float(s)

好吧-这里有一些代码可以用几个regexp语句替换。

^{pr2}$

输出

extract_number('2'                  ):  [2.0]
extract_number('.2'                 ):  [0.2]
extract_number(2                    ):  [2.0]
extract_number(0.2                  ):  [0.2]
extract_number('EUR 200'            ):  [200.0]
extract_number('EUR 200.00  -11.2'  ):  [200.0, -11.2]
extract_number('EUR 200  EUR 300'   ):  [200.0, 300.0]
extract_number('$ -1.000,22'        ):   [-1000.22]
extract_number('EUR 100.2345,3443'  ):  []
extract_number('111,145,234.345.345'):  []
extract_number('20,5  20,8'         ):  [20.5, 20.8]
extract_number('20.345.32.231,50'   ):  []

网友

2楼 · 编辑于 2024-05-12 17:37:29

您可以使用一个合适的特殊正则表达式来实现这一点。这是我最好的尝试。我使用命名的捕获组，因为对于模式，这种复杂的、数值的组在反向引用中使用会更加混乱。在

首先，regexp模式：

_pattern = r"""(?x)       # enable verbose mode (which ignores whitespace and comments)
    ^                     # start of the input
    [^\d+-\.]*            # prefixed junk
    (?P<number>           # capturing group for the whole number
        (?P<sign>[+-])?       # sign group (optional)
        (?P<integer_part>         # capturing group for the integer part
            \d{1,3}               # leading digits in an int with a thousands separator
            (?P<sep>              # capturing group for the thousands separator
                [ ,.]                 # the allowed separator characters
            )
            \d{3}                 # exactly three digits after the separator
            (?:                   # non-capturing group
                (?P=sep)              # the same separator again (a backreference)
                \d{3}                 # exactly three more digits
            )*                    # repeated 0 or more times
        |                     # or
            \d+                   # simple integer (just digits with no separator)
        )?                    # integer part is optional, to allow numbers like ".5"
        (?P<decimal_part>     # capturing group for the decimal part of the number
            (?P<point>            # capturing group for the decimal point
                (?(sep)               # conditional pattern, only tested if sep matched
                    (?!                   # a negative lookahead
                        (?P=sep)              # backreference to the separator
                    )
                )
                [.,]                  # the accepted decimal point characters
            )
            \d+                   # one or more digits after the decimal point
        )?                    # the whole decimal part is optional
    )
    [^\d]*                # suffixed junk
    $                     # end of the input
"""

下面是一个函数来使用它：

^{pr2}$

一些只有一个逗号或句点且后面正好有三个数字的数字字符串（例如"1,234"和"1.234"）是不明确的。这段代码将把它们都解析为带有一千个分隔符（1234）的整数，而不是浮点值（1.234），而不管实际使用的分隔符是什么。如果您希望这些数字有不同的结果（例如，如果您希望使用1.234进行浮点运算），则可以使用一个特殊情况来处理此问题。在

一些测试输出：

>>> test_cases = ["2", "2.3", "2,35", "-2 000,5", "EUR 1.000,74 €",
                  "20,5 20,8", "20.345.32.231,50", "1.234"]
>>> for s in test_cases:
    print("{!r:20}: {}".format(s, parse_number(s)))


'2'                 : 2
'2.3'               : 2.3
'2,35'              : 2.35
'-2 000,5'          : -2000.5
'EUR 1.000,74 €'    : 1000.74
'20,5 20,8'         : None
'20.345.32.231,50'  : None
'1.234'             : 1234

约束

代码

输出

相关问题更多 >

编程相关推荐

热门问题

热门文章