循环通过聚合文本值的函数时出错

2024-03-28 13:17:45 发布

您现在位置:Python中文网/ 问答频道 /正文

我的功能有问题。设计是将单词标记聚合到字典中

代码如下:

def preprocess (texts):
   case = truecase.get_true_case(texts)
   doc = nlp(case)
   return doc

def summarize_texts(texts):
    doc = preprocess(texts) 
    actions = {}
    entities = {}
    for token in doc:
        if token.pos_ == "VERB":
            actions[token.lemma_] = actions.get(token.text, 0) +1
    for token in doc.ents:
         entities[token.label_] = [token.text]
    return {
            'actions': actions,
            'entities': entities
        })

我遇到的问题是,对于单个输入,函数按预期工作:

summarize_texts("Play something by Billie Holiday")

{'actions': {'play': 1}, 'entities': {'PERSON': ['Billie']}}

但目标是能够通过它传递列表或csv文件,并使其聚合所有内容

当我尝试时:

docs = [
    "Play something by Billie Holiday",
    "Set a timer for five minutes",
    "Play it again, Sam"
]
summarize_texts(docs)

我得到一个错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-200347d5cac5> in <module>()
      4     "Play it again, Sam"
      5 ]
----> 6 summarize_texts(docs)

5 frames
<ipython-input-16-08c879553d6e> in summarize_texts(texts)
      1 def summarize_texts(texts):
----> 2     doc = preprocess(texts)
      3     actions = {}
      4     entities = {}
      5     for token in doc:

<ipython-input-12-fccf767830b1> in preprocess(texts)
      1 def preprocess (texts):
----> 2    case = truecase.get_true_case(texts)
      3    doc = nlp(case)
      4    return doc

/usr/local/lib/python3.6/dist-packages/truecase/__init__.py in get_true_case(sentence, out_of_vocabulary_token_option)
     14     return get_truecaser().get_true_case(
     15         sentence,
---> 16         out_of_vocabulary_token_option=out_of_vocabulary_token_option)

/usr/local/lib/python3.6/dist-packages/truecase/TrueCaser.py in get_true_case(self, sentence, out_of_vocabulary_token_option)
     97             as-is: Returns OOV tokens as is
     98         """
---> 99         tokens = self.tknzr.tokenize(sentence)
    100 
    101         tokens_true_case = []

/usr/local/lib/python3.6/dist-packages/nltk/tokenize/casual.py in tokenize(self, text)
    293         """
    294         # Fix HTML character entities:
--> 295         text = _replace_html_entities(text)
    296         # Remove username handles
    297         if self.strip_handles:

/usr/local/lib/python3.6/dist-packages/nltk/tokenize/casual.py in _replace_html_entities(text, keep, remove_illegal, encoding)
    257         return "" if remove_illegal else match.group(0)
    258 
--> 259     return ENT_RE.sub(_convert_entity, _str_to_unicode(text, encoding))
    260 
    261 

TypeError: expected string or bytes-like object

我希望得到以下结果:

{'actions': {'play': 2, 'set': 1}, 'entities': {'PERSON': ['Billie', 'Sam'], 'TIME': ['five minutes']}}

不确定我的函数语法有什么问题


2条回答

在调用预处理之前,尝试对文本使用for循环

for i in texts:
    doc = preprocess(i) 

看起来您的问题是truecase.get_true_case(texts)希望接收一个类似字符串/字节的参数,而您正在向它传递一个字符串列表

您需要遍历texts并分别预处理列表中的每个项目:

def preprocess (text):
   case = truecase.get_true_case(text)
   doc = nlp(case)
   return doc

def summarize_texts(texts):
    actions = {}
    entities = {}
    for text in texts:
        doc = preprocess(text) 
        for token in doc:
            if token.pos_ == "VERB":
                actions[token.lemma_] = actions.get(token.text, 0) +1
        for token in doc.ents:
             entities[token.label_] = [token.text]
    return {
        'actions': actions,
        'entities': entities
    })

相关问题 更多 >