我正在尝试学习beautifulsoup来刮取《纽约时报》政治文章中的文本,目前使用我现在拥有的代码,它确实能够刮取两段,但之后,它吐出AttributeError:“NoType”对象没有属性“get_text”。我已经查找了这个错误,一些线程声称错误源于使用beautifulsoup3中的遗留函数。但这似乎不是问题所在,有什么想法吗
代码:
import requests
from urllib import request, response, error, parse
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.nytimes.com/2020/02/10/us/politics/trump-manchin-impeachment.html"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
title = soup.title
titleText = title.get_text()
body = soup.find('article', class_='css-1vxca1d')
section = soup.find('section', class_="css-1r7ky0e")
for elem in section:
div1 = elem.findAll('div')
for x in div1:
div2 = elem.findAll('div')
for i in div2:
text = i.find('p').get_text()
print (text)
print("----------")
输出:
WASHINGTON — Senator Joe Manchin III votes with President Trump more than any other Democrat in the Senate. But his vote last week to convict Mr. Trump of impeachable offenses has eclipsed all of that, earning him the rage of a president who coveted a bipartisan acquittal.
----------
“Munchkin means that you’re small, right?” he said. “I’m bigger than him — of course he has me by weight, now, he has more volume than I have by about 30 or 40 pounds. I’m far from being weak and pathetic, and I’m far from being a munchkin, and I still want him to succeed as president of the United States.”
----------
Traceback (most recent call last):
File "/Users/user/PycharmProjects/project2/webscrapper.py", line 25, in <module>
text = i.find('p').get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'
Process finished with exit code 1
正如我在评论中提到的,当您执行
text = i.find('p').get_text()
时,实际上您正在执行2个操作首先获取所有的
<p>
标记,然后获取它们的文本i.find('p')
在某个点返回None
。所以None.get_text()
给出了一个错误您可以看到这一点,因为错误消息告诉您
'NoneType' object has no attribute 'get_text'
从docs开始:
快速修复方法是检查
i.find('p')
是否不返回None
:还要注意
find()
只返回第一个<p>
,如果有,则忽略其他相关问题 更多 >
编程相关推荐