为什么Beauty Soup不打印网站上的文字（路透社），即使文字清楚地存在于网站上？

page = requests.get("https://www.reuters.com/article/us-usa-banks-conference-jpmorgan/jpmorgan-ceo-dimon-sees-u-s-economic-expansion-continuing-idUSKCN1IX508") soup = BeautifulSoup(page.content, 'lxml') headlines = soup.find_all('time') for headline in headlines: headline_text = headline.get_text(strip=True) print("done:", headline_text)

2条回答

网友

1楼 · 编辑于 2024-05-12 15:31:01

网站以不同方式加载内容：

View Url is: https://www.reuters.com/article/us-usa-banks-conference-jpmorgan/jpmorgan-ceo-dimon-sees-u-s-economic-expansion-continuing-idUSKCN1IX508

不过，这会动态加载内容，但通过使用开发人员工具（“网络”），我可以看到还有一个即时文章URL路径

Actual Url to use: https://www.reuters.com/article/instant-article/idUSKCN1IX508

因此，必须做的是从视图url中筛选最后一部分，即idUSKCN1IX508，并在实际url中使用它来发出get（）请求。因此，变化如下：

page = requests.get("https://www.reuters.com/article/instant-article/idUSKCN1IX508")
soup = BeautifulSoup(page.content,"html.parser")
soup.find_all("time")
Out[12]: 
[<time class="op-published" datetime="2018-06-01T14:21:41Z"></time>,
 <time class="op-modified" datetime="2018-06-01T14:19:02Z"></time>]

此外，要以文本形式获取时间：

for item in obj_list:
    print("DateTime of the Article   {}".format(item.get("datetime")))

DateTime of the Article   2018-06-01T14:21:41Z
DateTime of the Article   2018-06-01T14:19:02Z

网友

2楼 · 编辑于 2024-05-12 15:31:01

该网站是动态加载的，因此requests不支持它。我们可以使用Selenium作为刮取页面的替代方法

安装时使用：pip install selenium

从here下载正确的ChromeDriver

from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup

URL = "https://www.reuters.com/article/us-usa-banks-conference-jpmorgan/jpmorgan-ceo-dimon-sees-u-s-economic-expansion-continuing-idUSKCN1IX508"

driver = webdriver.Chrome(r"c:\path\to\chromedriver.exe")
driver.get(URL)

# Wait for the page to fully render
sleep(5)

soup = BeautifulSoup(driver.page_source, "html.parser")

for tag in soup.find_all("time"):
    print(tag.get_text(strip=True))

driver.quit()

输出：

June 1, 2018
9:21 AM
Updated 2 years ago

相关问题更多 >

编程相关推荐

热门问题

热门文章