当响应文本没有在我的浏览器中显示的所有内容时,如何使用BeautifulSoup刮取web内容?

2024-05-13 09:07:44 发布

您现在位置:Python中文网/ 问答频道 /正文

使用JupyterNotebook(ipynb),我试图用BeautifulSoup来抓取web内容,但是响应文本并没有显示在浏览器中的所有内容。我试图拉文章标题和段落文本,但我无法拉段落文本,因为它没有在我的浏览器中显示。你知道吗

url = https://mars.nasa.gov/news/?page=0&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2C184%2C204&blank_scope=Latest

我转到浏览器中的url,看到了我要查找的内容:

<div class="article_teaser_body">New evidence suggests salty, shallow ponds once dotted a Martian crater — a sign of the planet's drying climate.</div>

但是,当我打印我的回复文本时,我没有看到里面的内容。你知道吗

from bs4 import BeautifulSoup
import requests
import os

url = 'https://mars.nasa.gov/news/?page=0&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2C184%2C204&blank_scope=Latest'

response = requests.get(url)

soup = BeautifulSoup(response.text, 'lxml')

title = soup.find_all('div', class_='content_title')

# This outputs exactly what I need but when I try to do it 
# for the paragraph text (see below code), in outputs an empty list.

results = soup.find_all('div', class_='article_teaser_body')

结果列表为空


Tags: https文本importdivurl内容page浏览器
1条回答
网友
1楼 · 发布于 2024-05-13 09:07:44

你不需要使用美丽的汤刮这个网址

https://mars.nasa.gov/news/?page=0&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2C184%2C204&blank_scope=Latest

当我检查network选项卡时,我发现这个页面实际上是使用从API请求中获取的JSON来获取文章正文,这可以很容易地使用requests库来完成。你知道吗

你可以试试下面的代码

import requests
import json # need to use this to pretty print json easily 

headers = {
  'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0' # to mimic regular browser user-agent
}

params = (
    ('page', '0'),
    ('per_page', '40'), # you tweak this value to 1000 or maybe more to get more data from a single request
    ('order', 'publish_date desc,created_at desc'),
    ('search', ''),
    ('category', '19,165,184,204'),
    ('blank_scope', 'Latest'),
)

# params is equivalent to page=1&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2C184%2C204&blank_scope=Latest

r = requests.get('https://mars.nasa.gov/api/v1/news_items/', params=params, headers=headers).json()

print(json.dumps(r, indent=4)) # prints the raw json respone

'''
The article data is contained inside the key `"items"`, we can iterate over `"items"` and 
print article title and body. Do check the raw json response to find 
other data included along with article title and body. You just need 
to use the key name to get those values like you see in the below code. 
'''

for article in r["items"]:
  print("Title :", article["title"])
  print("Body :",article["body"])

看看这个动作here。你知道吗

相关问题 更多 >