当响应文本没有在我的浏览器中显示的所有内容时，如何使用BeautifulSoup刮取web内容？

from bs4 import BeautifulSoup import requests import os url = 'https://mars.nasa.gov/news/?page=0&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2C184%2C204&blank_scope=Latest' response = requests.get(url) soup = BeautifulSoup(response.text, 'lxml') title = soup.find_all('div', class_='content_title') # This outputs exactly what I need but when I try to do it # for the paragraph text (see below code), in outputs an empty list. results = soup.find_all('div', class_='article_teaser_body')

1条回答

网友

1楼 · 发布于 2024-05-13 09:07:44

你不需要使用美丽的汤刮这个网址

https://mars.nasa.gov/news/?page=0&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2C184%2C204&blank_scope=Latest

当我检查network选项卡时，我发现这个页面实际上是使用从API请求中获取的JSON来获取文章正文，这可以很容易地使用requests库来完成。你知道吗

你可以试试下面的代码

import requests
import json # need to use this to pretty print json easily 

headers = {
  'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0' # to mimic regular browser user-agent
}

params = (
    ('page', '0'),
    ('per_page', '40'), # you tweak this value to 1000 or maybe more to get more data from a single request
    ('order', 'publish_date desc,created_at desc'),
    ('search', ''),
    ('category', '19,165,184,204'),
    ('blank_scope', 'Latest'),
)

# params is equivalent to page=1&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2C184%2C204&blank_scope=Latest

r = requests.get('https://mars.nasa.gov/api/v1/news_items/', params=params, headers=headers).json()

print(json.dumps(r, indent=4)) # prints the raw json respone

'''
The article data is contained inside the key `"items"`, we can iterate over `"items"` and 
print article title and body. Do check the raw json response to find 
other data included along with article title and body. You just need 
to use the key name to get those values like you see in the below code. 
'''

for article in r["items"]:
  print("Title :", article["title"])
  print("Body :",article["body"])

看看这个动作here。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章