循环通过聚合文本值的函数时出错

def preprocess (texts): case = truecase.get_true_case(texts) doc = nlp(case) return doc def summarize_texts(texts): doc = preprocess(texts) actions = {} entities = {} for token in doc: if token.pos_ == "VERB": actions[token.lemma_] = actions.get(token.text, 0) +1 for token in doc.ents: entities[token.label_] = [token.text] return { 'actions': actions, 'entities': entities })

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-18-200347d5cac5> in <module>() 4 "Play it again, Sam" 5 ] ----> 6 summarize_texts(docs) 5 frames <ipython-input-16-08c879553d6e> in summarize_texts(texts) 1 def summarize_texts(texts): ----> 2 doc = preprocess(texts) 3 actions = {} 4 entities = {} 5 for token in doc: <ipython-input-12-fccf767830b1> in preprocess(texts) 1 def preprocess (texts): ----> 2 case = truecase.get_true_case(texts) 3 doc = nlp(case) 4 return doc /usr/local/lib/python3.6/dist-packages/truecase/__init__.py in get_true_case(sentence, out_of_vocabulary_token_option) 14 return get_truecaser().get_true_case( 15 sentence, ---> 16 out_of_vocabulary_token_option=out_of_vocabulary_token_option) /usr/local/lib/python3.6/dist-packages/truecase/TrueCaser.py in get_true_case(self, sentence, out_of_vocabulary_token_option) 97 as-is: Returns OOV tokens as is 98 """ ---> 99 tokens = self.tknzr.tokenize(sentence) 100 101 tokens_true_case = [] /usr/local/lib/python3.6/dist-packages/nltk/tokenize/casual.py in tokenize(self, text) 293 """ 294 # Fix HTML character entities: --> 295 text = _replace_html_entities(text) 296 # Remove username handles 297 if self.strip_handles: /usr/local/lib/python3.6/dist-packages/nltk/tokenize/casual.py in _replace_html_entities(text, keep, remove_illegal, encoding) 257 return "" if remove_illegal else match.group(0) 258 --> 259 return ENT_RE.sub(_convert_entity, _str_to_unicode(text, encoding)) 260 261 TypeError: expected string or bytes-like object

2条回答

网友

1楼 · 编辑于 2024-05-15 10:18:16

在调用预处理之前，尝试对文本使用for循环

for i in texts:
    doc = preprocess(i)

网友

2楼 · 编辑于 2024-05-15 10:18:16

看起来您的问题是truecase.get_true_case(texts)希望接收一个类似字符串/字节的参数，而您正在向它传递一个字符串列表

您需要遍历texts并分别预处理列表中的每个项目：

def preprocess (text):
   case = truecase.get_true_case(text)
   doc = nlp(case)
   return doc

def summarize_texts(texts):
    actions = {}
    entities = {}
    for text in texts:
        doc = preprocess(text) 
        for token in doc:
            if token.pos_ == "VERB":
                actions[token.lemma_] = actions.get(token.text, 0) +1
        for token in doc.ents:
             entities[token.label_] = [token.text]
    return {
        'actions': actions,
        'entities': entities
    })

相关问题更多 >

编程相关推荐

热门问题

热门文章