NLTK正则表达式分词器与小数点不兼容

5 投票

2 回答

4044 浏览

提问于 2025-04-17 20:47

我正在尝试写一个文本规范化工具，其中一个基本的功能是把像 3.14 这样的数字转换成 three point one four 或者 three point fourteen。

目前我使用的模式是 \$?\d+(\.\d+)?%?，配合 nltk.regexp_tokenize，我认为这个模式应该能处理数字、货币和百分比。不过，现在的情况是，像 $23.50 这样的格式处理得很好（它解析成 ['$23.50']），但 3.14 却解析成了 ['3', '14'] - 小数点被丢掉了。

我尝试在我的正则表达式中添加一个单独的模式 \d+.\d+，但没有帮助（难道我现在的模式不应该已经匹配到这个吗？）

编辑 2: 我还发现 % 的部分似乎也不太对 - 20% 只返回 ['20']。我觉得我的正则表达式可能有问题，但我在 Pythex 上测试过，似乎没什么问题？

编辑: 这是我的代码。

import nltk
import re

pattern = r'''(?x)    # set flag to allow verbose regexps
            ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
            | \w+([-']\w+)*        # words w/ optional internal hyphens/apostrophe
            | \$?\d+(\.\d+)?%?  # numbers, incl. currency and percentages
            | [+/\-@&*]         # special characters with meanings
            '''
    words = nltk.regexp_tokenize(line, pattern)
    words = [string.lower(w) for w in words]
    print words

以下是我的一些测试字符串：

32188
2598473
26 letters from A to Z
3.14 is pi.                         <-- ['3', '14', 'is', 'pi']
My weight is about 68 kg, +/- 10 grams.
Good muffins cost $3.88 in New York <-- ['good', 'muffins', 'cost', '$3.88', 'in', 'new', 'york']

正则表达式数字处理模式匹配 nltk 分词器货币格式文本规范化百分比解析

2 个回答

试试这个正则表达式：

\b\$?\d+(\.\d+)?%?\b

我在最开始的正则表达式周围加上了单词边界，使用的是：\b。

回答于 2025-04-17 由 Python大师

分享举报

问题的根源在于：

\w+([-']\w+)*

\w+ 这个表达式会匹配数字，因为在这里没有 .，所以它只会匹配 3，而不是 3.14。你可以调整一下选项，把 \$?\d+(\.\d+)?%? 放在上面的正则表达式之前（这样就能先尝试匹配数字格式）：

(?x)([A-Z]\.)+|\$?\d+(\.\d+)?%?|\w+([-']\w+)*|[+/\-@&*]

regex101 演示

或者用扩展的形式表示：

pattern = r'''(?x)               # set flag to allow verbose regexps
              ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
              | \$?\d+(\.\d+)?%? # numbers, incl. currency and percentages
              | \w+([-']\w+)*    # words w/ optional internal hyphens/apostrophe
              | [+/\-@&*]        # special characters with meanings
            '''

回答于 2025-04-17 由 Python大师

分享举报

NLTK正则表达式分词器与小数点不兼容

2 个回答

撰写回答