Python:检测数字分隔符并解析为浮点数,无需区域设置

8 投票
2 回答
3746 浏览
提问于 2025-04-18 14:39

我有一个数据集,里面有数百万个文本文件,里面的数字都是以字符串的形式保存的,而且使用了不同的地区格式来表示这些数字。我想做的是猜测哪个符号是小数点,哪个符号是千位分隔符。

这应该不算太难,但似乎这个问题还没有被问过,为了以后能找到,应该在这里提出来并回答。

我知道的是,字符串中总是有一个小数点分隔符,而且它总是出现在最后一个不是数字的符号位置。

如你所见,简单地用 numStr.replace(',', '.') 来修正小数点分隔符的不同,会和可能的千位分隔符产生冲突。

我见过一些方法可以在知道地区格式的情况下处理这个问题,但在这种情况下我并不知道地区格式。

数据集:

1.0000 //1.0
1,0000 //1.0
10,000.0000 //10000.0
10.000,0000 //10000.0
1,000,000.0000 // 1000000.0
1.000.000,0000 // 1000000.0

//also possible

1 000 000.0000 //1000000.0 with spaces as thousand separators

2 个回答

2

另一种方法也能检查数字格式是否错误,提醒可能的误解,并且比目前的解决方案更快(下面有性能报告):

import re

pattern_comma_thousands_dot_decimal = re.compile(r'^[-+]?((\d{1,3}(,\d{3})*)|(\d*))(\.|\.\d*)?$')
pattern_dot_thousands_comma_decimal = re.compile(r'^[-+]?((\d{1,3}(\.\d{3})*)|(\d*))(,|,\d*)?$')
pattern_confusion_dot_thousands = re.compile(r'^(?:[-+]?(?=.*\d)(?=.*[1-9]).{1,3}\.\d{3})$')  # for numbers like '100.000' (is it 100.0 or 100000?)
pattern_confusion_comma_thousands = re.compile(r'^(?:[-+]?(?=.*\d)(?=.*[1-9]).{1,3},\d{3})$')  # for numbers like '100,000' (is it 100.0 or 100000?)


def parse_number_with_guess_for_separator_chars(number_str: str, max_val=None):
    """
    Tries to guess the thousands and decimal characters (comma or dot) and converts the string number accordingly.
    The return also indicates if the correctness of the result is certain or uncertain
    :param number_str: a string with the number to convert
    :param max_val: an optional parameter determining the allowed maximum value.
                     This helps prevent mistaking the decimal separator as a thousands separator.
                     For instance, if max_val is 101 then the string '100.000' which would be
                     interpreted as 100000.0 will instead be interpreted as 100.0
    :return: a tuple with the number as a float an a flag (`True` if certain and `False` if uncertain)
    """
    number_str = number_str.strip().lstrip('0')
    certain = True
    if pattern_confusion_dot_thousands.match(number_str) is not None:
        number_str = number_str.replace('.', '')  # assume dot is thousands separator
        certain = False
    elif pattern_confusion_comma_thousands.match(number_str) is not None:
        number_str = number_str.replace(',', '')  # assume comma is thousands separator
        certain = False
    elif pattern_comma_thousands_dot_decimal.match(number_str) is not None:
        number_str = number_str.replace(',', '')
    elif pattern_dot_thousands_comma_decimal.match(number_str) is not None:
        number_str = number_str.replace('.', '').replace(',', '.')
    else:
        raise ValueError()  # For stuff like '10,000.000,0' and other nonsense

    number = float(number_str)
    if not certain and max_val is not None and number > max_val:
        number *= 0.001  # Change previous assumption to decimal separator, so '100.000' goes from 100000.0 to 100.0
        certain = True  # Since this uniquely satisfies the given constraint, it should be a certainly correct interpretation

    return number, certain

在最糟糕的情况下的性能:

python -m timeit "parse_number_with_guess_for_separator_chars('10,043,353.23')"
100000 loops, best of 5: 2.01 usec per loop

python -m timeit "John1024_solution('10.089.434,54')"
100000 loops, best of 5: 3.04 usec per loop

在最好情况下的性能:

python -m timeit "parse_number_with_guess_for_separator_chars('10.089')"       
500000 loops, best of 5: 946 nsec per loop

python -m timeit "John1024_solution('10.089')"       
100000 loops, best of 5: 1.97 usec per loop
7

一种方法:

import re
with open('numbers') as fhandle:
    for line in fhandle:
        line = line.strip()
        separators = re.sub('[0-9]', '', line)
        for sep in separators[:-1]:
            line = line.replace(sep, '')
        if separators:
            line = line.replace(separators[-1], '.')
        print(line)

在你的示例输入(去掉注释后),输出结果是:

1.0000
1.0000
10000.0000
10000.0000
1000000.0000
1000000.0000
1000000.0000

更新:处理Unicode

正如NeoZenith在评论中提到的,使用现代的Unicode字体,传统的正则表达式 [0-9] 可能不太可靠。可以用下面的方式替代:

import re
with open('numbers') as fhandle:
    for line in fhandle:
        line = line.strip()
        separators = re.sub(r'\d', '', line, flags=re.U)
        for sep in separators[:-1]:
            line = line.replace(sep, '')
        if separators:
            line = line.replace(separators[-1], '.')
        print(line)

如果不加 re.U 这个标志,\d 就等同于 [0-9]加上这个标志后\d 会匹配Unicode字符属性数据库中被分类为十进制数字的任何字符。或者,如果你需要处理一些不常见的数字字符,可以考虑使用 unicode.translate

撰写回答