如何用Python抓取网页中的表格?
我用Python来抓取这个网页上的信息。但是当我点击“上一页”按钮时,抓取的内容就无法获取了。我想过用Selenium来解决这个问题,但在让它无头模式下运行时没有成功。
下面的代码是我用来抓取比赛链接的:
import urllib2
import re
site_url = 'http://us.soccerway.com'
national_league_div_sub_matches_url = 'http://us.soccerway.com/national/england/premier-league/20132014/regular-season/r21322/'
national_league_div_sub_matches_url_source = urllib2.urlopen(national_league_div_sub_matches_url).read()
match_links = re.findall('(/matches/[0-9][0-9][0-9][0-9]/.*?ICID.*?)">', national_league_div_sub_matches_url_source)
match_links = map(lambda x: ''.join([site_url, x]), match_links)
for x in match_links:
print x
1 个回答
0
当你在浏览器中点击 previous
时,JavaScript 会调用一个很长的 url
来从服务器获取 JSON 数据——所以你也要这样做。
import requests, json
url = 'http://us.soccerway.com/a/block_competition_matches_summary?block_id=page_competition_1_block_competition_matches_summary_6&callback_params=%7B%22page%22%3A0%2C%22bookmaker_urls%22%3A%7B%2213%22%3A%5B%7B%22link%22%3A%22http%3A%2F%2Fwww.bet365.com%2Fhome%2F%3Faffiliate%3D365_308124%22%2C%22name%22%3A%22Bet%20365%22%7D%5D%7D%2C%22block_service_id%22%3A%22competition_summary_block_competitionmatchessummary%22%2C%22round_id%22%3A21322%2C%22outgroup%22%3Afalse%2C%22view%22%3A2%7D&action=changePage¶ms=%7B%22page%22%3A-1%7D'
r = requests.get(url)
#print r.text
data = json.loads(r.text)
print data
现在你在 data
中有了 dict
,所以你需要找出你需要的内容。
每次在浏览器中点击 previous
,url
可能都会改变,所以如果你想获取更早的数据,也需要这样做。