如何使用Python和Beautiful Soup抓取BBC文章的标题?
我之前做过一个抓取BBC网站内容的程序,它可以从特定的文章中抓取标题,比如说这篇文章。不过,BBC最近改了他们的网站,所以我需要对我的抓取程序进行修改,这变得有点困难。比如说,我想从刚才提到的那篇文章中抓取标题。我用Firefox检查网页的HTML代码,找到了对应的HTML属性,就是data-component="headline-block"
(在图片中蓝色标记的那一行)。
如果我想提取这个标签,我会这样做:
import requests
from bs4 import BeautifulSoup
url = 'https://www.bbc.com/news/world-africa-68504329'
# extract html
html = requests.get(url).text
# parse html
soup = BeautifulSoup(html, 'html.parser')
# extract headline from soup
head = soup.find(attrs = {'data-component': 'headline-block'})
但是当我打印head
的值时,它返回None
,这意味着Beautiful Soup找不到这个标签。我漏掉了什么呢?我该如何解决这个问题?
1 个回答
1
你在页面上看到的数据是以Json格式存储在页面内部的(所以beautifulsoup无法直接看到这些数据)。如果你想获取标题和文章内容,可以参考这个例子:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.bbc.com/news/world-africa-68504329"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").text)
# print(json.dumps(data, indent=4))
page = next(
v for k, v in data["props"]["pageProps"]["page"].items() if k.startswith("@")
)
for c in page["contents"]:
match c["type"]:
case "headline":
print(c["model"]["blocks"][0]["model"]["text"])
print()
case "text":
print(c["model"]["blocks"][0]["model"]["text"], end=" ")
print()
输出结果是:
Kuriga kidnap: More than 280 Nigerian pupils abducted
More than 280 Nigerian school pupils have been abducted in the north-western town of Kuriga, officials say. The pupils were in the assembly ground around 08:30 (07:30 GMT) when dozens of gunmen on motorcycles rode through the school, one witness said. The students, between the ages of eight and 15, were taken away, along with a teacher, they added. Kidnap gangs, known as bandits, have seized thousands of people in recent years, especially the north-west. However, there had been a reduction in the mass abduction of children over the past year until this week. Those kidnapped are usually freed after a ransom is paid. The mass abduction was
...