使用beautifulsoup'NoneType'对象的Webscraping没有属性'get\u text'

2024-03-29 13:50:40 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试学习beautifulsoup来刮取《纽约时报》政治文章中的文本,目前使用我现在拥有的代码,它确实能够刮取两段,但之后,它吐出AttributeError:“NoType”对象没有属性“get_text”。我已经查找了这个错误,一些线程声称错误源于使用beautifulsoup3中的遗留函数。但这似乎不是问题所在,有什么想法吗

代码:

import requests
from urllib import request, response, error, parse
from urllib.request import urlopen
from bs4 import BeautifulSoup




url = "https://www.nytimes.com/2020/02/10/us/politics/trump-manchin-impeachment.html"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")


title = soup.title
titleText = title.get_text()

body = soup.find('article', class_='css-1vxca1d')

section = soup.find('section', class_="css-1r7ky0e")
for elem in section:
    div1 = elem.findAll('div')
    for x in div1:
        div2 = elem.findAll('div')
        for i in div2:
            text = i.find('p').get_text()
            print (text)
            print("----------")

输出:

WASHINGTON — Senator Joe Manchin III votes with President Trump more than any other Democrat in the Senate. But his vote last week to convict Mr. Trump of impeachable offenses has eclipsed all of that, earning him the rage of a president who coveted a bipartisan acquittal.
----------
“Munchkin means that you’re small, right?” he said. “I’m bigger than him — of course he has me by weight, now, he has more volume than I have by about 30 or 40 pounds. I’m far from being weak and pathetic, and I’m far from being a munchkin, and I still want him to succeed as president of the United States.”
----------
Traceback (most recent call last):
  File "/Users/user/PycharmProjects/project2/webscrapper.py", line 25, in <module>
    text = i.find('p').get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'

Process finished with exit code 1


Tags: oftextinfromimportforgettitle
1条回答
网友
1楼 · 发布于 2024-03-29 13:50:40

正如我在评论中提到的,当您执行text = i.find('p').get_text()时,实际上您正在执行2个操作

首先获取所有的<p>标记,然后获取它们的文本i.find('p')在某个点返回None。所以None.get_text()给出了一个错误

您可以看到这一点,因为错误消息告诉您'NoneType' object has no attribute 'get_text'

docs开始:

If find_all() can’t find anything, it returns an empty list. If find() can’t find anything, it returns None

快速修复方法是检查i.find('p')是否不返回None

# ...
for elem in section:
    div1 = elem.findAll('div')
    for x in div1:
        div2 = elem.findAll('div')
        for i in div2:
            p = i.find('p')
            if p is not None:
                text = p.get_text()
                print (text)
                print("     ")

还要注意find()只返回第一个<p>,如果有,则忽略其他

相关问题 更多 >