KeyError:“更正”在使用python中的GingerIt对pandas中的文本数据解析文本时

KeyError Traceback (most recent call last) <ipython-input-25-ea5c757d88d2> in <module>() 5 jd = [] 6 for txt in list(datajd['Job Description']): ---->7 jd.append(GingerIt().parse(txt)['result']) /usr/local/lib/python3.7/dist-packages/gingerit/gingerit.py in parse(self, text, verify) 26 ) 27 data = request.json() ---> 28 return self._process_data(text, data) 29 30 @staticmethod /usr/local/lib/python3.7/dist-packages/gingerit/gingerit.py in _process_data(self, text, data) 38 corrections = [] 39 ---> 40 for suggestion in reversed(data["Corrections"]): 41 start = suggestion["From"] 42 end = suggestion["To"] KeyError: 'Corrections'

1条回答

网友

1楼 · 发布于 2024-04-26 03:06:44

GingerIt有一个基于API密钥的付费高级服务，因此免费版本不能处理超过300个字符的句子

您可以使用自己选择的分句器，在这里，您可以使用[pysb语用句子边界消歧模块][1]（使用pip install pysbd安装）。然后，通过Ginger运行长度小于300个字符的句子，并加入结果

如果你可以有长句，但你仍然想处理它们，确保你进一步细分这些句子。在这里，我建议使用类似正则表达式的[^;:\n•]+[;,:\n•]?\s*，它包含;、:、换行符和一个要点，但您可以添加更多需要的字符

from gingerit.gingerit import GingerIt # pip install gingerit
import pandas as pd
import pysbd, re # pip install pysbd

file  = r'test.csv'

segmentor = pysbd.Segmenter(language="en", clean=False)
data = pd.read_csv(file)

subsegment_re = r'[^;:\n•]+[;,:\n•]?\s*'

def runGinger(par):
    fixed = []
    for sentence in segmentor.segment(par):
        if len(sentence) < 300:
            fixed.append(GingerIt().parse(sentence)['result'])
        else:
            subsegments = re.findall(subsegment_re, sentence)
            if len(subsegments) == 1 or any(len(v) < 300 for v in subsegments):
                # print(f'Skipped: {sentence}') // No grammar check possible
                fixed.append(sentence)
            else:
                res = []
                for s in subsegments:
                    res.append(GingerIt().parse(s)['result'])
                fixed.append("".join(res))
    return " ".join(fixed)

data['jd'] = data['Job Description'].apply(lambda x: runGinger(x))

相关问题更多 >

编程相关推荐

热门问题

热门文章