nltk正则表达式分词

2024-04-25 19:19:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图用python中的nltk实现一个正则表达式标记器,但结果是:

>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x)    # set flag to allow verbose regexps
...     ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
...   | \w+(-\w+)*        # words with optional internal hyphens
...   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
...   | \.\.\.            # ellipsis
...   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]

但想要的结果是:

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

为什么?哪里出错了?


Tags: totext标记importverbosethatflagpattern
1条回答
网友
1楼 · 发布于 2024-04-25 19:19:51

您应该将所有捕获组转换为非捕获组:

  • ([A-Z]\.)+>;(?:[A-Z]\.)+
  • \w+(-\w+)*->;\w+(?:-\w+)*
  • \$?\d+(\.\d+)?%?\$?\d+(?:\.\d+)?%?

问题是regexp_tokenize似乎在使用re.findall,在模式中定义了多个捕获组时返回捕获元组列表。见this nltk.tokenize package reference

pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)

另外,我不确定您是否希望使用:-_匹配一个包含所有大写字母的范围,将-放在character类的末尾。

因此,使用

pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

相关问题 更多 >