使用Python BeautifulSoup从中文网站获取页脚

0 投票

1 回答

1028 浏览

提问于 2025-04-18 01:11

我正在尝试从一个中文网站获取数据。我已经找到了数据在网页代码中的位置，但需要帮助来提取出文本。目前我有的代码是：

from bs4 import BeautifulSoup
import requests

page = 'http://sbj.speiyou.com/search/index/subject:/grade:12/gtype:time'
r = requests.get(page)

r.encoding = 'utf-8'
soup = BeautifulSoup(r.text)

div = soup.find('div', class_='pagination mtop40')

我想要的数据是 1/16 中的 16。

beautifulsoup 爬虫技术网页数据提取中文网站解析

1 个回答

在 div.text 上使用正则表达式是一种选择。下面这个正则表达式会查找任何数字后面跟着一个斜杠，然后再跟着更多的数字。

import re
pattern = re.compile(r'\d+\/\d+')
matches = re.search(pattern, div.text)
num = matches.group(0) # num = 1/16 here
print num.split('/')[1]

或者

import re
pattern = re.compile(r'\d+\/(\d+)') # Group the needed data in the regex pattern
matches = re.search(pattern, div.text)
print matches.group(0)

回答于 2025-04-18 由 Python大师

分享举报

使用Python BeautifulSoup从中文网站获取页脚

1 个回答

撰写回答