使用正则表达式进行句子分割
我有一些短信内容,想用句号('.')来分割它们。但是我遇到了一些特殊类型的消息,处理起来有点困难。我该怎么用Python中的正则表达式来分割这些消息呢?
分割前:
'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u' 'no of beds 8.please inform person in-charge.tq'
分割后:
'hyper count 16.8mmol/l' 'plz review b4 5pm' 'just to inform u' 'thank u' 'no of beds 8' 'please inform person in-charge' 'tq'
每一行都是一条独立的消息
更新:
我正在做自然语言处理,觉得把 '16.8mmmol/l'
和 'no of beds 8.2 cups of tea.'
当作相同的处理是可以的。对我来说,80%的准确率就足够了,但我想尽量减少 假阳性
的情况。
5 个回答
2
你可以使用一种叫做“负向前瞻”的技巧,来匹配一个“.”后面没有数字的情况,然后用 re.split
来处理这个匹配的结果:
>>> import re
>>> splitter = r"\.(?!\d)"
>>> s = 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u'
>>> re.split(splitter, s)
['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
>>> s = 'no of beds 8.please inform person in-charge.tq'
>>> re.split(splitter, s)
['no of beds 8', 'please inform person in-charge', 'tq']
5
几周前,我在找一种正则表达式,想要能够识别字符串中所有表示数字的部分,不管数字是以什么形式写的,包括科学计数法的数字,还有用逗号分隔的印度数字。你可以看看这个讨论。
我在下面的代码中使用了这个正则表达式,来解决你的问题。
和其他答案不同的是,在我的解决方案中,像'8.'这样的点不会被当作需要分割的点,因为它可以被理解为一个浮点数,后面没有数字。
import re
regx = re.compile('(?<![\d.])(?!\.\.)'
'(?<![\d.][eE][+-])(?<![\d.][eE])(?<!\d[.,])'
'' #---------------------------------
'([+-]?)'
'(?![\d,]*?\.[\d,]*?\.[\d,]*?)'
'(?:0|,(?=0)|(?<!\d),)*'
'(?:'
'((?:\d(?!\.[1-9])|,(?=\d))+)[.,]?'
'|\.(0)'
'|((?<!\.)\.\d+?)'
'|([\d,]+\.\d+?))'
'0*'
'' #---------------------------------
'(?:'
'([eE][+-]?)(?:0|,(?=0))*'
'(?:'
'(?!0+(?=\D|\Z))((?:\d(?!\.[1-9])|,(?=\d))+)[.,]?'
'|((?<!\.)\.(?!0+(?=\D|\Z))\d+?)'
'|([\d,]+\.(?!0+(?=\D|\Z))\d+?))'
'0*'
')?'
'' #---------------------------------
'(?![.,]?\d)')
simpler_regex = re.compile('(?<![\d.])0*(?:'
'(\d+)\.?|\.(0)'
'|(\.\d+?)|(\d+\.\d+?)'
')0*(?![\d.])')
def split_outnumb(string, regx=regx, a=0):
excluded_pos = [x for mat in regx.finditer(string) for x in range(*mat.span()) if string[x]=='.']
li = []
for xdot in (x for x,c in enumerate(string) if c=='.' and x not in excluded_pos):
li.append(string[a:xdot])
a = xdot + 1
li.append(string[a:])
return li
for sentence in ('hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u',
'no of beds 8.please inform person in-charge.tq',
'no of beds 8.2 cups of tea.tarabada',
'this number .977 is a float',
'numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific.notation',
'an indian number 12,45,782.258 in this.sentence and 45,78,325. is another',
'no dot in this sentence',
''):
print 'sentence =',sentence
print 'splitted eyquem =',split_outnumb(sentence)
print 'splitted eyqu 2 =',split_outnumb(sentence,regx=simpler_regex)
print 'splitted gurney =',re.split(r"\.(?!\d)", sentence)
print 'splitted stema =',re.split('(?<!\d)\.|\.(?!\d)',sentence)
print
结果
sentence = hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u
splitted eyquem = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted eyqu 2 = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted gurney = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted stema = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
sentence = no of beds 8.please inform person in-charge.tq
splitted eyquem = ['no of beds 8.please inform person in-charge', 'tq']
splitted eyqu 2 = ['no of beds 8.please inform person in-charge', 'tq']
splitted gurney = ['no of beds 8', 'please inform person in-charge', 'tq']
splitted stema = ['no of beds 8', 'please inform person in-charge', 'tq']
sentence = no of beds 8.2 cups of tea.tarabada
splitted eyquem = ['no of beds 8.2 cups of tea', 'tarabada']
splitted eyqu 2 = ['no of beds 8.2 cups of tea', 'tarabada']
splitted gurney = ['no of beds 8.2 cups of tea', 'tarabada']
splitted stema = ['no of beds 8.2 cups of tea', 'tarabada']
sentence = this number .977 is a float
splitted eyquem = ['this number .977 is a float']
splitted eyqu 2 = ['this number .977 is a float']
splitted gurney = ['this number .977 is a float']
splitted stema = ['this number ', '977 is a float']
sentence = numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific.notation
splitted eyquem = ['numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific', 'notation']
splitted eyqu 2 = ['numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific', 'notation']
splitted gurney = ['numbers 214.21E+45 , 478945', 'E-201 and .12478E+02 are in scientific', 'notation']
splitted stema = ['numbers 214.21E+45 , 478945', 'E-201 and ', '12478E+02 are in scientific', 'notation']
sentence = an indian number 12,45,782.258 in this.sentence and 45,78,325. is another
splitted eyquem = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325. is another']
splitted eyqu 2 = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325. is another']
splitted gurney = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325', ' is another']
splitted stema = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325', ' is another']
sentence = no dot in this sentence
splitted eyquem = ['no dot in this sentence']
splitted eyqu 2 = ['no dot in this sentence']
splitted gurney = ['no dot in this sentence']
splitted stema = ['no dot in this sentence']
sentence =
splitted eyquem = ['']
splitted eyqu 2 = ['']
splitted gurney = ['']
splitted stema = ['']
编辑 1
我添加了一个simpler_regex,用于检测数字,这个正则表达式来自我在这个讨论中的一个帖子。
这个正则表达式不能识别印度数字和科学计数法的数字,但实际上它给出的结果是一样的。
1
那这个呢
re.split('(?<!\d)\.|\.(?!\d)', 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u')
这个“前后查找”确保在某一边或者另一边都不是数字。所以这也包括了 16.8
这种情况。如果两边都是数字,这个表达式就不会进行分割。