KeyError:“更正”在使用python中的GingerIt对pandas中的文本数据解析文本时

2024-04-26 03:06:44 发布

您现在位置:Python中文网/ 问答频道 /正文

##!pip install gingerit

from gingerit.gingerit import GingerIt
jd = []
for txt in list(data['Job Description']):
   jd.append(GingerIt().parse(txt)['result'])
data['jd'] = jd

我想纠正pandas数据框中约3000行的文本功能/列中的拼写和语法错误。每行包含4-5条语句。所以,我使用了GingerIt.GingerIt中的GingerIt(),我得到了一个错误

KeyError                                  Traceback (most recent call last)
<ipython-input-25-ea5c757d88d2> in <module>()
     5           jd = []
     6           for txt in list(datajd['Job Description']):
---->7           jd.append(GingerIt().parse(txt)['result'])


/usr/local/lib/python3.7/dist-packages/gingerit/gingerit.py in parse(self, text, verify)
      26         )
      27         data = request.json()
 ---> 28         return self._process_data(text, data)
      29 
      30     @staticmethod

 /usr/local/lib/python3.7/dist-packages/gingerit/gingerit.py in _process_data(self, text, data)
      38         corrections = []
      39 
 ---> 40         for suggestion in reversed(data["Corrections"]):
      41             start = suggestion["From"]
      42             end = suggestion["To"]

 KeyError: 'Corrections'

Tags: textinselftxtfordataparsejob
1条回答
网友
1楼 · 发布于 2024-04-26 03:06:44

GingerIt有一个基于API密钥的付费高级服务,因此免费版本不能处理超过300个字符的句子

您可以使用自己选择的分句器,在这里,您可以使用[pysb语用句子边界消歧模块][1](使用pip install pysbd安装)。然后,通过Ginger运行长度小于300个字符的句子,并加入结果

如果你可以有长句,但你仍然想处理它们,确保你进一步细分这些句子。在这里,我建议使用类似正则表达式的[^;:\n•]+[;,:\n•]?\s*,它包含;:、换行符和一个要点,但您可以添加更多需要的字符

from gingerit.gingerit import GingerIt # pip install gingerit
import pandas as pd
import pysbd, re # pip install pysbd

file  = r'test.csv'

segmentor = pysbd.Segmenter(language="en", clean=False)
data = pd.read_csv(file)

subsegment_re = r'[^;:\n•]+[;,:\n•]?\s*'

def runGinger(par):
    fixed = []
    for sentence in segmentor.segment(par):
        if len(sentence) < 300:
            fixed.append(GingerIt().parse(sentence)['result'])
        else:
            subsegments = re.findall(subsegment_re, sentence)
            if len(subsegments) == 1 or any(len(v) < 300 for v in subsegments):
                # print(f'Skipped: {sentence}') // No grammar check possible
                fixed.append(sentence)
            else:
                res = []
                for s in subsegments:
                    res.append(GingerIt().parse(s)['result'])
                fixed.append("".join(res))
    return " ".join(fixed)

data['jd'] = data['Job Description'].apply(lambda x: runGinger(x))

相关问题 更多 >

    热门问题