Python BeautifulSoup未抓取完整表格
我不确定是不是因为 mechanize
的原因,导致没有抓取到完整的表格。
这个代码能正常工作:
from bs4 import BeautifulSoup
import requests
page = 'http://www.airchina.com.cn/www/jsp/airlines_operating_data/exlshow_en.jsp'
r = requests.get(page)
r.encoding = 'utf-8'
soup = BeautifulSoup(r.text)
div = soup.find('div', class_='mainRight').find_all('div')[1]
table = div.find('table', recursive=False)
for row in table.find_all('tr', recursive=False):
for cell in row('td', recursive=False):
print cell.text.split()
但是这个就不行:
import mechanize
from bs4 import BeautifulSoup
import requests
URL='http://www.airchina.com.cn/www/jsp/airlines_operating_data/exlshow_en.jsp'
control_year=['2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014']
control_month=['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
br = mechanize.Browser()
r=br.open(URL)
br.select_form("exl")
control_m = br.form.find_control('month')
control_y = br.form.find_control('year')
br[control_m.name]=['06']
br[control_y.name]=['2012']
response = br.submit()
soup = BeautifulSoup(response,'html.parser')
#div = soup.find('div', class_='mainRight')
div = soup.find('div', class_='mainRight').find_all('div')[1]
table = div.find('table', recursive=False)
for row in table.find_all('tr', recursive=False):
for cell in row('td', recursive=False):
print cell.text.strip()
使用 mechanize
的那个代码只输出了下面的内容,尽管我在火狐的开发者工具中能看到所有的 tr
和 td
。
Jun 2012
% change vs Jun 2011
% change vs May 2012
Cumulative Jun 2012
% cumulative change
1 个回答
1
当把这两者结合在一起时,它可以正常工作,所以可能和你使用的 html.parser
有关系。
import mechanize
from bs4 import BeautifulSoup
URL = ('http://www.airchina.com.cn/www/jsp/airlines_operating_data/'
'exlshow_en.jsp')
control_year = ['2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
'2014']
control_month = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10',
'11', '12']
br = mechanize.Browser()
r = br.open(URL)
br.select_form("exl")
control_m = br.form.find_control('month')
control_y = br.form.find_control('year')
br[control_m.name] = ['06']
br[control_y.name] = ['2012']
response = br.submit()
soup = BeautifulSoup(response)
div = soup.find('div', class_='mainRight').find_all('div')[1]
table = div.find('table', recursive=False)
for row in table.find_all('tr', recursive=False):
for cell in row('td', recursive=False):
print cell.text.split()