如何使用Python从《纽约时报》的在线文章中获取数据？

import bs4 import requests session = requests.Session() url = "http://www.nytimes.com/2017/01/04/world/asia/china-xinhua-donald-trump-twitter.html" page = session.get(url).text soup = bs4.BeautifulSoup(page) comments= soup.find_all(class_='comments-panel') for e in comments: print comments.string

1条回答

网友

1楼 · 发布于 2024-04-18 22:48:35

包含所有注释的“注释”选项卡是隐藏的，将通过javascript事件显示。按照@eLRuLL的建议，您可以使用selenium打开comment选项卡并检索注释，如下所示（在Python 3中）：

import time
from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.firefox.webdriver.WebDriver(executable_path='.../geckodriver') # adapt the path to the geckodriver

# set the browser window size to desktop view
driver.set_window_size(2024, 1000)

url = "http://www.nytimes.com/2017/01/04/world/asia/china-xinhua-donald-trump-twitter.html"
driver.get(url)

# waiting for the page is fully loaded
time.sleep(5)

# select the link 'SEE ALL COMMENTS' and click it
elem = driver.find_element_by_css_selector('li.comment-count').click()

# get source code and close the browser
page  = driver.page_source
driver.close()

soup = BeautifulSoup(page)

comments = soup.find_all('div', class_='comments-panel')
print(comments[0].prettify())

编辑：

要检索所有评论和对评论的所有回复，您需要1）选择元素“READ MORE”和“SEE all replies”，2）迭代并单击它们。我相应地修改了代码示例：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章