修复字符串中错误的分隔符

2024-05-13 18:41:17 发布

您现在位置:Python中文网/ 问答频道 /正文

给定不正确的字符串:

s="rate implies depreciation. Th  e straight lines show eff ective linear time trends in the nominal (dashed "

我想输出正确的字符串,如:

s="rate implies depreciation. The straight lines show effective linear time trends in the nominal (dashed"

如果我尝试使用以下命令删除所有分隔符:

re.sub("\\s*","",s)

它将给我: “费率意味着折旧。铁路线显示出有效的线性部分以虚线表示”,这不是我想要的


Tags: the字符串inratetimeshowlinearlines
1条回答
网友
1楼 · 发布于 2024-05-13 18:41:17

您可以尝试检查单词拼写,例如使用pyspellchecker

(pip安装pyspellchecker)

from spellchecker import SpellChecker
spell = SpellChecker()

s="rate implies depreciation. Th  e straight lines show eff ective linear time trends in the nominal (dashed "
splitted_s = s.split(' ')
splitted_s = list(filter(None, splitted_s)) #remove empty element in between two consecutive space

然后检查一个单词是否不存在,但前一个单词+单词是否存在:

    valid_s = [splitted_s[0]]
    for i in range(1,len(splitted_s)):
      word = splitted_s[i]
      previous_word = splitted_s[i-1]
      valid_s.append(word)
      if spell.unknown([word]) and len(word)>0:
        if not spell.unknown([(previous_word+word).lower()]):
          valid_s.pop()
          valid_s.pop()
          valid_s.append(previous_word+word)

    print(' '.join(valid_s))

 >>>rate implies depreciation. Th e straight lines show effective linear time trends in the nominal (dashed

但在这里,因为e在字典中作为一个词存在,所以它不连接th和e

所以,如果上一个单词+单词在字典中的使用频率(远)高于单词,您还可以比较单词频率,并将上一个单词与单词连接起来:

    valid_s = [splitted_s[0]]
    for i in range(1,len(splitted_s)):
      word = splitted_s[i]
      previous_word = splitted_s[i-1]
      valid_s.append(splitted_s[i])
      if spell.word_probability(word.lower())<spell.word_probability((previous_word+word).lower()):
        valid_s.pop()
        valid_s.pop()
        valid_s.append(previous_word+word)


    print(' '.join(valid_s))

 >>>rate implies depreciation. The straight lines show effective linear time trends in the nominal (dashed

相关问题 更多 >