如何使用Python和Beautiful Soup抓取BBC文章的标题？

1 投票

1 回答

46 浏览

提问于 2025-04-12 16:29

我之前做过一个抓取BBC网站内容的程序，它可以从特定的文章中抓取标题，比如说这篇文章。不过，BBC最近改了他们的网站，所以我需要对我的抓取程序进行修改，这变得有点困难。比如说，我想从刚才提到的那篇文章中抓取标题。我用Firefox检查网页的HTML代码，找到了对应的HTML属性，就是data-component="headline-block"（在图片中蓝色标记的那一行）。

如果我想提取这个标签，我会这样做：

import requests

from bs4 import BeautifulSoup

url = 'https://www.bbc.com/news/world-africa-68504329'

# extract html
html = requests.get(url).text

# parse html
soup = BeautifulSoup(html, 'html.parser')

# extract headline from soup
head = soup.find(attrs = {'data-component': 'headline-block'})

但是当我打印head的值时，它返回None，这意味着Beautiful Soup找不到这个标签。我漏掉了什么呢？我该如何解决这个问题？

数据提取网页抓取 html解析 beautiful soup 网络爬虫错误调试网站结构标签选择

1 个回答

你在页面上看到的数据是以Json格式存储在页面内部的（所以beautifulsoup无法直接看到这些数据）。如果你想获取标题和文章内容，可以参考这个例子：

import json

import requests
from bs4 import BeautifulSoup

url = "https://www.bbc.com/news/world-africa-68504329"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").text)

# print(json.dumps(data, indent=4))

page = next(
    v for k, v in data["props"]["pageProps"]["page"].items() if k.startswith("@")
)
for c in page["contents"]:
    match c["type"]:
        case "headline":
            print(c["model"]["blocks"][0]["model"]["text"])
            print()
        case "text":
            print(c["model"]["blocks"][0]["model"]["text"], end=" ")

print()

输出结果是：

Kuriga kidnap: More than 280 Nigerian pupils abducted

More than 280 Nigerian school pupils have been abducted in the north-western town of Kuriga, officials say.  The pupils were in the assembly ground around 08:30 (07:30 GMT) when dozens of gunmen on motorcycles rode through the school, one witness said. The students, between the ages of eight and 15, were taken away, along with a teacher, they added. Kidnap gangs, known as bandits, have seized thousands of people in recent years, especially the north-west. However, there had been a reduction in the mass abduction of children over the past year until this week. Those kidnapped are usually freed after a ransom is paid. The mass abduction was

...

回答于 2025-04-12 由 Python大师

分享举报

如何使用Python和Beautiful Soup抓取BBC文章的标题？

1 个回答

撰写回答