BeautifulSoup 从多个表格提取数据
我正在尝试用BeautifulSoup从一个html文件中提取一些数据,这个文件里有两个html表格。
这是我第一次使用BeautifulSoup,我查了很多问题和例子,但都不太适合我的情况。这个html文件包含两个表格,第一个表格有第一列的标题(这些标题总是文本),而第二个表格则包含后面几列的数据。此外,表格里有文本、数字和符号,这让像我这样的初学者觉得一切都更复杂了。这里是从浏览器复制的html布局。
我能提取到的只是第一张表格的所有行的html内容,所以实际上我并没有获取到任何数据,只得到了第一列的内容。
我想得到的输出是一个字符串,包含表格的“联合”信息(Col1=文本,Col2=数字,Col3=数字,Col4=数字,Col5=数字),比如:
Canada, 6, 5, 2, 1
这是每个项目的Xpath列表:
"Canada": /html/body/div/div[1]/table/tbody[2]/tr[2]/td/div/a
"6": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[1]
"5": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[3]
"2": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[5]
"1": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[7]
如果能得到“粗略”的html格式的字符串,我也会很高兴,只要每一行有一个字符串,这样我就可以用我已经知道的方法进一步解析。以下是我目前的代码。谢谢!
from BeautifulSoup import BeautifulSoup
html="""
my html code
"""
soup = BeautifulSoup(html)
table=soup.find("table")
for row in table.findAll('tr'):
col = row.findAll('td')
print row, col
3 个回答
我想把我自己的版本放在这里。其实我不太明白为什么大家还在用Beautifulsoup来抓取网页,其实直接用LXML里的XPath要简单得多。下面是同样的问题,可能用一种更容易阅读和更新的方式呈现:
from lxml import html, etree
tree = html.parse("sample.html").xpath('//body/div/div')
lxml_getData = lambda x: "{}, {}, {}, {}".format(lxml_getValue(x.xpath('.//td')[0]), lxml_getValue(x.xpath('.//td')[2]), lxml_getValue(x.xpath('.//td')[4]), lxml_getValue(x.xpath('.//td')[6]))
lxml_getValue = lambda x: etree.tostring(x, method="text", encoding='UTF-8').strip()
locations = tree[0].xpath('.//tbody')[1].xpath('./tr')
locations.pop(0) # Don't need first row
data = tree[1].xpath('.//tbody')[1].xpath('./tr')
data.pop(0) # Don't need first row
for f, b in zip(locations, data):
print(lxml_getValue(f), lxml_getData(b))
看起来你是在从 http://www.appannie.com 上抓取数据。
这里有一段代码可以用来获取这些数据。我相信代码的某些部分可以改得更好,或者用更符合Python风格的方式来写。不过这段代码能满足你的需求。另外,我用的是Beautiful Soup 4,而不是3。
from bs4 import BeautifulSoup
html_file = open('test2.html')
soup = BeautifulSoup(html_file)
countries = []
countries_table = soup.find_all('table', attrs={'class':'data-table table-rank'})[1]
countries_body = countries_table.find_all('tbody')[1]
countries_row = countries_body.find_all('tr', attrs={"class": "ranks"})
for row in countries_row:
countries.append(row.div.a.text)
data = []
data_table = soup.find_all('table', attrs={'class':'data-table table-rank'})[3]
data_body = data_table.find_all('tbody')[1]
data_row = data_body.find_all('tr', attrs={"class": "ranks"})
for row in data_row:
tds = row.find_all('td')
sublist = []
for td in tds[::2]:
sublist.append(td.text)
data.append(sublist)
for element in zip(countries, data):
print element
希望这对你有帮助 :)
使用 bs4
,但这个方法应该也能奏效:
from bs4 import BeautifulSoup as bsoup
ofile = open("htmlsample.html")
soup = bsoup(ofile)
soup.prettify()
tables = soup.find_all("tbody")
storeTable = tables[0].find_all("tr")
storeValueRows = tables[2].find_all("tr")
storeRank = []
for row in storeTable:
storeRank.append(row.get_text().strip())
storeMatrix = []
for row in storeValueRows:
storeMatrixRow = []
for cell in row.find_all("td")[::2]:
storeMatrixRow.append(cell.get_text().strip())
storeMatrix.append(", ".join(storeMatrixRow))
for record in zip(storeRank, storeMatrix):
print " ".join(record)
上面的代码会输出:
# of countries - rank 1 reached 0, 0, 1, 9
# of countries - rank 5 reached 0, 8, 49, 29
# of countries - rank 10 reached 25, 31, 49, 32
# of countries - rank 100 reached 49, 49, 49, 32
# of countries - rank 500 reached 49, 49, 49, 32
# of countries - rank 1000 reached 49, 49, 49, 32
[Finished in 0.5s]
如果把 storeTable
改成 tables[1]
,把 storeValueRows
改成 tables[3]
,那么输出会变成:
Country
Canada 6, 5, 2, 1
Brazil 7, 5, 2, 1
Hungary 7, 6, 2, 2
Sweden 9, 5, 1, 1
Malaysia 10, 5, 2, 1
Mexico 10, 5, 2, 2
Greece 10, 6, 2, 1
Israel 10, 6, 2, 1
Bulgaria 10, 6, 2, -
Chile 10, 6, 2, -
Vietnam 10, 6, 2, -
Ireland 10, 6, 2, -
Kuwait 10, 6, 2, -
Finland 10, 7, 2, -
United Arab Emirates 10, 7, 2, -
Argentina 10, 7, 2, -
Slovakia 10, 7, 2, -
Romania 10, 8, 2, -
Belgium 10, 9, 2, 3
New Zealand 10, 13, 2, -
Portugal 10, 14, 2, -
Indonesia 10, 14, 2, -
South Africa 10, 15, 2, -
Ukraine 10, 15, 2, -
Philippines 10, 16, 2, -
United Kingdom 11, 5, 2, 1
Denmark 11, 6, 2, 2
Australia 12, 9, 2, 3
United States 13, 9, 2, 2
Austria 13, 9, 2, 3
Turkey 14, 5, 2, 1
Egypt 14, 5, 2, 1
Netherlands 14, 8, 2, 2
Spain 14, 11, 2, 4
Thailand 15, 10, 2, 3
Singapore 16, 10, 2, 2
Switzerland 16, 10, 2, 3
Taiwan 17, 12, 2, 4
Poland 17, 13, 2, 5
France 18, 8, 2, 3
Czech Republic 18, 13, 2, 6
Germany 19, 11, 2, 3
Norway 20, 14, 2, 5
India 20, 14, 2, 5
Italy 20, 15, 2, 7
Hong Kong 26, 21, 2, -
Japan 33, 16, 4, 5
Russia 33, 17, 2, 7
South Korea 46, 27, 2, 5
[Finished in 0.6s]
这段代码不是最完美的,还有改进的空间。不过,逻辑上是没问题的。
希望这对你有帮助。
编辑:
如果你想要的格式是 South Korea, 46, 27, 2, 5
,而不是 South Korea 46, 27, 2, 5
(注意国家名字后面有个 ,
),只需要把这个:
storeRank.append(row.get_text().strip())
改成这个:
storeRank.append(row.get_text().strip() + ",")