如何使用Python和Beautiful Soup抓取BBC文章的标题?

1 投票
1 回答
46 浏览
提问于 2025-04-12 16:29

我之前做过一个抓取BBC网站内容的程序,它可以从特定的文章中抓取标题,比如说这篇文章。不过,BBC最近改了他们的网站,所以我需要对我的抓取程序进行修改,这变得有点困难。比如说,我想从刚才提到的那篇文章中抓取标题。我用Firefox检查网页的HTML代码,找到了对应的HTML属性,就是data-component="headline-block"(在图片中蓝色标记的那一行)。

查看蓝色标记的那一行。

如果我想提取这个标签,我会这样做:

import requests

from bs4 import BeautifulSoup

url = 'https://www.bbc.com/news/world-africa-68504329'

# extract html
html = requests.get(url).text

# parse html
soup = BeautifulSoup(html, 'html.parser')

# extract headline from soup
head = soup.find(attrs = {'data-component': 'headline-block'})

但是当我打印head的值时,它返回None,这意味着Beautiful Soup找不到这个标签。我漏掉了什么呢?我该如何解决这个问题?

1 个回答

1

你在页面上看到的数据是以Json格式存储在页面内部的(所以无法直接看到这些数据)。如果你想获取标题和文章内容,可以参考这个例子:

import json

import requests
from bs4 import BeautifulSoup

url = "https://www.bbc.com/news/world-africa-68504329"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").text)

# print(json.dumps(data, indent=4))

page = next(
    v for k, v in data["props"]["pageProps"]["page"].items() if k.startswith("@")
)
for c in page["contents"]:
    match c["type"]:
        case "headline":
            print(c["model"]["blocks"][0]["model"]["text"])
            print()
        case "text":
            print(c["model"]["blocks"][0]["model"]["text"], end=" ")

print()

输出结果是:

Kuriga kidnap: More than 280 Nigerian pupils abducted

More than 280 Nigerian school pupils have been abducted in the north-western town of Kuriga, officials say.  The pupils were in the assembly ground around 08:30 (07:30 GMT) when dozens of gunmen on motorcycles rode through the school, one witness said. The students, between the ages of eight and 15, were taken away, along with a teacher, they added. Kidnap gangs, known as bandits, have seized thousands of people in recent years, especially the north-west. However, there had been a reduction in the mass abduction of children over the past year until this week. Those kidnapped are usually freed after a ransom is paid. The mass abduction was

...

撰写回答