如何从HTML中使用Python在<div>中对特定标记<p>进行web抓取

Release date: December 6, 2011 Last updated: January 10, 2012 Vulnerability identifier: APSA11-04 CVE number: CVE-2011-2462 Platform: All *Note: Adobe Reader for Android and Adobe Flash Player are not affected by this issue.

2条回答

网友

1楼 · 编辑于 2024-05-15 02:49:26

如果您知道希望始终在<h2>标记之后使用前4个<p>标记，则可以使用以下示例：

import requests
from bs4 import BeautifulSoup


url = "https://www.adobe.com/support/security/advisories/apsa11-04.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

txt = "\n".join(
    map(lambda x: x.get_text(strip=True, separator=" "), soup.select("h2 ~ p")[:4])
)
print(txt)

印刷品：

Release date: December 6, 2011
Last updated: January  10, 2012
Vulnerability identifier: APSA11-04
CVE number: CVE-2011-2462

网友

2楼 · 编辑于 2024-05-15 02:49:26

我不会检索整个集合，而是使用:nth-of-type对选择器本身中的前4个同级p标记进行更有效的筛选：

import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint
    
r = requests.get('https://www.adobe.com/support/security/advisories/apsa11-04.html')
soup = bs(r.content, 'html.parser')
pprint([i.text for i in soup.select('h2 ~ p:nth-of-type(-n+4)')])

您还可以使用limit argument：

pprint([i.text for i in soup.select('h2 ~ p', limit = 4)])

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何从HTML中使用Python在<div>中对特定标记<p>进行web抓取

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >