如何从HTML中使用Python在<div>中对特定标记<p>进行web抓取

2024-04-29 10:52:03 发布

您现在位置:Python中文网/ 问答频道 /正文

我想提取的数据来自这个网站https://www.adobe.com/support/security/advisories/apsa11-04.html。 我只想提取

Release date: December 6, 2011 Last updated: January 10, 2012 Vulnerability identifier: APSA11-04 CVE number: CVE-2011-2462

守则:

from bs4 import BeautifulSoup
div = soup.find("div", attrs={"id": "L0C1-body"})
for p in div.findAll("p"):
    if p.find('strong'):
        print(p.text)

输出:

Release date: December 6, 2011
Last updated: January  10, 2012
Vulnerability identifier: APSA11-04
CVE number: CVE-2011-2462
Platform: All
*Note: Adobe Reader for Android and Adobe Flash Player are not affected by this issue.

我不想要这个信息。我应该如何过滤它

Platform: All *Note: Adobe Reader for Android and Adobe Flash Player are not affected by this issue.


Tags: divnumberfordatereleasefindadobelast
2条回答

如果您知道希望始终在<h2>标记之后使用前4个<p>标记,则可以使用以下示例:

import requests
from bs4 import BeautifulSoup


url = "https://www.adobe.com/support/security/advisories/apsa11-04.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

txt = "\n".join(
    map(lambda x: x.get_text(strip=True, separator=" "), soup.select("h2 ~ p")[:4])
)
print(txt)

印刷品:

Release date: December 6, 2011
Last updated: January  10, 2012
Vulnerability identifier: APSA11-04
CVE number: CVE-2011-2462

我不会检索整个集合,而是使用:nth-of-type对选择器本身中的前4个同级p标记进行更有效的筛选:

import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint
    
r = requests.get('https://www.adobe.com/support/security/advisories/apsa11-04.html')
soup = bs(r.content, 'html.parser')
pprint([i.text for i in soup.select('h2 ~ p:nth-of-type(-n+4)')])

您还可以使用limit argument

pprint([i.text for i in soup.select('h2 ~ p', limit = 4)])

相关问题 更多 >