使用正则表达式进行句子分割

2 投票

5 回答

2083 浏览

提问于 2025-04-16 21:49

我有一些短信内容，想用句号（'.'）来分割它们。但是我遇到了一些特殊类型的消息，处理起来有点困难。我该怎么用Python中的正则表达式来分割这些消息呢？

分割前：

'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u'
'no of beds 8.please inform person in-charge.tq'

分割后：

'hyper count 16.8mmol/l' 'plz review b4 5pm' 'just to inform u' 'thank u'
'no of beds 8' 'please inform person in-charge' 'tq'

每一行都是一条独立的消息

更新：

我正在做自然语言处理，觉得把 '16.8mmmol/l' 和 'no of beds 8.2 cups of tea.' 当作相同的处理是可以的。对我来说，80%的准确率就足够了，但我想尽量减少 假阳性 的情况。

正则表达式文本处理自然语言处理语言模型句子分割消息解析

5 个回答

你可以使用一种叫做“负向前瞻”的技巧，来匹配一个“.”后面没有数字的情况，然后用 re.split 来处理这个匹配的结果：

>>> import re
>>> splitter = r"\.(?!\d)"
>>> s = 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u'
>>> re.split(splitter, s)
['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
>>> s = 'no of beds 8.please inform person in-charge.tq'
>>> re.split(splitter, s)
['no of beds 8', 'please inform person in-charge', 'tq']

回答于 2025-04-16 由 Python大师

分享举报

几周前，我在找一种正则表达式，想要能够识别字符串中所有表示数字的部分，不管数字是以什么形式写的，包括科学计数法的数字，还有用逗号分隔的印度数字。你可以看看这个讨论。

我在下面的代码中使用了这个正则表达式，来解决你的问题。

和其他答案不同的是，在我的解决方案中，像'8.'这样的点不会被当作需要分割的点，因为它可以被理解为一个浮点数，后面没有数字。

import re

regx = re.compile('(?<![\d.])(?!\.\.)'
                  '(?<![\d.][eE][+-])(?<![\d.][eE])(?<!\d[.,])'
                  '' #---------------------------------
                  '([+-]?)'
                  '(?![\d,]*?\.[\d,]*?\.[\d,]*?)'
                  '(?:0|,(?=0)|(?<!\d),)*'
                  '(?:'
                  '((?:\d(?!\.[1-9])|,(?=\d))+)[.,]?'
                  '|\.(0)'
                  '|((?<!\.)\.\d+?)'
                  '|([\d,]+\.\d+?))'
                  '0*'
                  '' #---------------------------------
                  '(?:'
                  '([eE][+-]?)(?:0|,(?=0))*'
                  '(?:'
                  '(?!0+(?=\D|\Z))((?:\d(?!\.[1-9])|,(?=\d))+)[.,]?'
                  '|((?<!\.)\.(?!0+(?=\D|\Z))\d+?)'
                  '|([\d,]+\.(?!0+(?=\D|\Z))\d+?))'
                  '0*'
                  ')?'
                  '' #---------------------------------
                  '(?![.,]?\d)')



simpler_regex = re.compile('(?<![\d.])0*(?:'
                           '(\d+)\.?|\.(0)'
                           '|(\.\d+?)|(\d+\.\d+?)'
                           ')0*(?![\d.])')


def split_outnumb(string, regx=regx, a=0):
    excluded_pos = [x for mat in regx.finditer(string) for x in range(*mat.span()) if string[x]=='.']
    li = []
    for xdot in (x for x,c in enumerate(string) if c=='.' and x not in excluded_pos):
        li.append(string[a:xdot])
        a = xdot + 1
    li.append(string[a:])
    return li





for sentence in ('hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u',
                 'no of beds 8.please inform person in-charge.tq',
                 'no of beds 8.2 cups of tea.tarabada',
                 'this number .977 is a float',
                 'numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific.notation',
                 'an indian number 12,45,782.258 in this.sentence and 45,78,325. is another',
                 'no dot in this sentence',
                 ''):
    print 'sentence         =',sentence
    print 'splitted eyquem  =',split_outnumb(sentence)
    print 'splitted eyqu 2  =',split_outnumb(sentence,regx=simpler_regex)
    print 'splitted gurney  =',re.split(r"\.(?!\d)", sentence)
    print 'splitted stema   =',re.split('(?<!\d)\.|\.(?!\d)',sentence)
    print

结果

sentence         = hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u
splitted eyquem  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted eyqu 2  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted gurney  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted stema   = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']

sentence         = no of beds 8.please inform person in-charge.tq
splitted eyquem  = ['no of beds 8.please inform person in-charge', 'tq']
splitted eyqu 2  = ['no of beds 8.please inform person in-charge', 'tq']
splitted gurney  = ['no of beds 8', 'please inform person in-charge', 'tq']
splitted stema   = ['no of beds 8', 'please inform person in-charge', 'tq']

sentence         = no of beds 8.2 cups of tea.tarabada
splitted eyquem  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted eyqu 2  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted gurney  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted stema   = ['no of beds 8.2 cups of tea', 'tarabada']

sentence         = this number .977 is a float
splitted eyquem  = ['this number .977 is a float']
splitted eyqu 2  = ['this number .977 is a float']
splitted gurney  = ['this number .977 is a float']
splitted stema   = ['this number ', '977 is a float']

sentence         = numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific.notation
splitted eyquem  = ['numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific', 'notation']
splitted eyqu 2  = ['numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific', 'notation']
splitted gurney  = ['numbers 214.21E+45 , 478945', 'E-201 and .12478E+02 are in scientific', 'notation']
splitted stema   = ['numbers 214.21E+45 , 478945', 'E-201 and ', '12478E+02 are in scientific', 'notation']

sentence         = an indian number 12,45,782.258 in this.sentence and 45,78,325. is another
splitted eyquem  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325. is another']
splitted eyqu 2  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325. is another']
splitted gurney  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325', ' is another']
splitted stema   = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325', ' is another']

sentence         = no dot in this sentence
splitted eyquem  = ['no dot in this sentence']
splitted eyqu 2  = ['no dot in this sentence']
splitted gurney  = ['no dot in this sentence']
splitted stema   = ['no dot in this sentence']

sentence         = 
splitted eyquem  = ['']
splitted eyqu 2  = ['']
splitted gurney  = ['']
splitted stema   = ['']

编辑 1

我添加了一个simpler_regex，用于检测数字，这个正则表达式来自我在这个讨论中的一个帖子。

这个正则表达式不能识别印度数字和科学计数法的数字，但实际上它给出的结果是一样的。

回答于 2025-04-16 由 Python大师

分享举报

那这个呢

re.split('(?<!\d)\.|\.(?!\d)', 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u')

这个“前后查找”确保在某一边或者另一边都不是数字。所以这也包括了 16.8 这种情况。如果两边都是数字，这个表达式就不会进行分割。

回答于 2025-04-16 由 Python大师

分享举报

使用正则表达式进行句子分割

5 个回答

编辑 1

撰写回答