捕捉异常并用pandas apply（）函数输入一些值？

from newspaper import Article import numpy as np import requests def text(link): article = Article(link) try: article.download() article = article.parse() except requests.exceptions.HTTPError: return np.nan return article df['text'] = df['links'].apply(text)

title Link Inside tiny tubes, water turns solid when it should be boiling http://news.mit.edu/2016/carbon-nanotubes-water-solid-boiling-1128 Four MIT students named 2017 Marshall Scholars http://news.mit.edu/2016/four-mit-students-marshall-scholars-11282 Saharan dust in the wind http://news.mit.edu/2016/saharan-dust-monsoons-11231 The science of friction on graphene http://news.mit.edu/2016/sliding-flexible-graphene-surfaces-1123

import numpy as np from newspaper import Article, ArticleException import requests def text_extractor2(link): article = Article(link) try: article.download() except ArticleException: article = article.parse() return np.nan return article df['text'] = df['Link'].apply(text_extractor2) df

title Link text 0 Inside tiny tubes, water turns solid when it s... http://news.mit.edu/2016/carbon-nanotubes-wate... <newspaper.article.Article object at 0x10c8a0320> 1 Four MIT students named 2017 Marshall Scholars http://news.mit.edu/2016/four-mit-students-mar... <newspaper.article.Article object at 0x1070df0f0> 2 Saharan dust in the wind http://news.mit.edu/2016/saharan-dust-monsoons... <newspaper.article.Article object at 0x107b035c0> 3 The science of friction on graphene http://news.mit.edu/2016/sliding-flexible-grap... <newspaper.article.Article object at 0x10c8bf8d0>

1条回答

网友

1楼 · 发布于 2024-05-16 06:27:27

根据我的理解，您希望与断开链接对应的行在text列中有一个NaN值。如果您尚未添加numpy导入，我们可以先添加：

import numpy as np

我假设抛出的异常是HTTPError，并将使用NumPy作为其NaN值：

^{pr2}$

然后，使用熊猫apply

df['text'] = df['links'].apply(text)

文本列应该包含断开链接的缺失值和有效链接的文章文本。在

不使用newspaper，您可以改变函数来捕捉ur.urlopen(url).read()上的异常，例如

def text_extractor(url):
    try:
        html = ur.urlopen(url).read()
    except ur.HTTPError:
        return np.nan

    soup = BeautifulSoup(html, 'lxml')
    for script in soup(["script", "style"]):
        script.extract()
        text = soup.get_text()
        lines = (line.strip() for line in text.splitlines())
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        text = ' '.join(chunk for chunk in chunks if chunk)
    sentences = ', '.join(sent_tokenize(str(text.strip('\'"') )))
    return sentences

相关问题更多 >

编程相关推荐

热门问题

热门文章