为什么Beauty Soup不打印网站上的文字(路透社),即使文字清楚地存在于网站上?

2024-05-12 15:31:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在从这个网站上抓取日期:https://www.reuters.com/article/us-usa-banks-conference-jpmorgan/jpmorgan-ceo-dimon-sees-u-s-economic-expansion-continuing-idUSKCN1IX508

当我试图获取页眉/灰色文本区域中的日期时,该日期无法打印

page = requests.get("https://www.reuters.com/article/us-usa-banks-conference-jpmorgan/jpmorgan-ceo-dimon-sees-u-s-economic-expansion-continuing-idUSKCN1IX508")
soup = BeautifulSoup(page.content, 'lxml')

headlines = soup.find_all('time')
for headline in headlines:
    headline_text = headline.get_text(strip=True)
    print("done:", headline_text)

此代码输出:

done: 
done: 
done: Updated

下图显示有清晰的文字,但为什么“2018年6月1日”没有打印出来

pic of the webpage

我尝试过使用html.parser和lxml,但两者都不起作用


Tags: texthttpscomwwwarticleusbanksdone
2条回答

网站以不同方式加载内容:

View Url is: https://www.reuters.com/article/us-usa-banks-conference-jpmorgan/jpmorgan-ceo-dimon-sees-u-s-economic-expansion-continuing-idUSKCN1IX508

不过,这会动态加载内容,但通过使用开发人员工具(“网络”),我可以看到还有一个即时文章URL路径

The content of the url

Actual Url to use: https://www.reuters.com/article/instant-article/idUSKCN1IX508

因此,必须做的是从视图url中筛选最后一部分,即idUSKCN1IX508,并在实际url中使用它来发出get()请求。因此,变化如下:

page = requests.get("https://www.reuters.com/article/instant-article/idUSKCN1IX508")
soup = BeautifulSoup(page.content,"html.parser")
soup.find_all("time")
Out[12]: 
[<time class="op-published" datetime="2018-06-01T14:21:41Z"></time>,
 <time class="op-modified" datetime="2018-06-01T14:19:02Z"></time>]

此外,要以文本形式获取时间:

for item in obj_list:
    print("DateTime of the Article   {}".format(item.get("datetime")))

DateTime of the Article   2018-06-01T14:21:41Z
DateTime of the Article   2018-06-01T14:19:02Z

该网站是动态加载的,因此requests不支持它。我们可以使用Selenium作为刮取页面的替代方法

安装时使用:pip install selenium

here下载正确的ChromeDriver

from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup

URL = "https://www.reuters.com/article/us-usa-banks-conference-jpmorgan/jpmorgan-ceo-dimon-sees-u-s-economic-expansion-continuing-idUSKCN1IX508"

driver = webdriver.Chrome(r"c:\path\to\chromedriver.exe")
driver.get(URL)

# Wait for the page to fully render
sleep(5)

soup = BeautifulSoup(driver.page_source, "html.parser")

for tag in soup.find_all("time"):
    print(tag.get_text(strip=True))

driver.quit()

输出:

June 1, 2018
9:21 AM
Updated 2 years ago

相关问题 更多 >