如何使用Python从《纽约时报》的在线文章中获取数据?

2024-04-18 22:48:35 发布

您现在位置:Python中文网/ 问答频道 /正文

以下是《纽约时报》一篇文章的URL:包含评论标签的URL是http://www.nytimes.com/2017/01/04/world/asia/china-xinhua-donald-trump-twitter.html

它有一个评论标签,我想用Python的beauthoulsouplib从网站抓取所有的评论来实现我的目标。在

下面是我的代码。但结果却是空穴来风。我想这是一个没有告诉计算机在哪里找到源链接的问题。有人可以修改吗?谢谢您!在

import bs4
import requests
session = requests.Session()
url = "http://www.nytimes.com/2017/01/04/world/asia/china-xinhua-donald-trump-twitter.html"
page  = session.get(url).text
soup = bs4.BeautifulSoup(page)
comments= soup.find_all(class_='comments-panel')
for e in comments:
    print comments.string

Tags: comhttpurlworldwww评论twitter标签
1条回答
网友
1楼 · 发布于 2024-04-18 22:48:35

包含所有注释的“注释”选项卡是隐藏的,将通过javascript事件显示。按照@eLRuLL的建议,您可以使用selenium打开comment选项卡并检索注释,如下所示(在Python 3中):

import time
from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.firefox.webdriver.WebDriver(executable_path='.../geckodriver') # adapt the path to the geckodriver

# set the browser window size to desktop view
driver.set_window_size(2024, 1000)

url = "http://www.nytimes.com/2017/01/04/world/asia/china-xinhua-donald-trump-twitter.html"
driver.get(url)

# waiting for the page is fully loaded
time.sleep(5)

# select the link 'SEE ALL COMMENTS' and click it
elem = driver.find_element_by_css_selector('li.comment-count').click()

# get source code and close the browser
page  = driver.page_source
driver.close()

soup = BeautifulSoup(page)

comments = soup.find_all('div', class_='comments-panel')
print(comments[0].prettify())

编辑:

要检索所有评论和对评论的所有回复,您需要1)选择元素“READ MORE”和“SEE all replies”,2)迭代并单击它们。 我相应地修改了代码示例:

^{pr2}$

相关问题 更多 >