nltk正则表达式分词

>>> import nltk >>> text = 'That U.S.A. poster-print costs $12.40...' >>> pattern = r'''(?x) # set flag to allow verbose regexps ... ([A-Z]\.)+ # abbreviations, e.g. U.S.A. ... | \w+(-\w+)* # words with optional internal hyphens ... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82% ... | \.\.\. # ellipsis ... | [][.,;"'?():-_`] # these are separate tokens; includes ], [ ... ''' >>> nltk.regexp_tokenize(text, pattern) [('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]

1条回答

网友

1楼 · 发布于 2024-05-23 15:58:08

您应该将所有捕获组转换为非捕获组：

([A-Z]\.)+>；(?:[A-Z]\.)+
\w+(-\w+)*->；\w+(?:-\w+)*
\$?\d+(\.\d+)?%?到\$?\d+(?:\.\d+)?%?

问题是regexp_tokenize似乎在使用re.findall，在模式中定义了多个捕获组时返回捕获元组列表。见this nltk.tokenize package reference：

pattern (str) – The pattern used to build this tokenizer. (This pattern must not contain capturing parentheses; Use non-capturing parentheses, e.g. (?:...), instead)

另外，我不确定您是否希望使用:-_匹配一个包含所有大写字母的范围，将-放在character类的末尾。

因此，使用

pattern = r'''(?x)          # set flag to allow verbose regexps
        (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
      | \w+(?:-\w+)*        # words with optional internal hyphens
      | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
      | \.\.\.              # ellipsis
      | [][.,;"'?():_`-]    # these are separate tokens; includes ], [
    '''

相关问题更多 >

编程相关推荐

热门问题

热门文章