使用BeautifulSoup获取多个标签和属性数据
我想用beautifulsoup这个工具从下面的HTML中获取多个标签和属性。
1) div的id是home_1039509
2) div的id是guest_1039509
3) id是odds_3_1039509
4) id是gs_1039509
5) id是hs_1039509
6) id是time_1039509
HTML内容如下:
<tr align="center" height="15" id="tr_1039509" bgcolor="#F7F3F7" index="0">
<td width="10">
<img src="images/lclose.gif" onclick="hidematch(0)" style="cursor:pointer;">
</td>
<td width="63" bgcolor="#d15023">
<font color="#ffffff">U18<br>
<span id="t_1039509">14:05</span>
</font>
</td>
<td width="115" style="text-align:left;">
<div id="home_1039509">
<a href="javascript:Team(19195)">U18()</a>
</div>
<div class="oddsAns">
[
<a href="javascript:AsianOdds('1039509')">A</a>
-
<a href="javascript:EuropeOdds(1039509)" target="_self">B</a>
-
</div>
<div id="guest_1039509">
<a href="javascript:Team(11013)">U18</a>
</div>
</td>
<td width="30">
<div id="gs_1039509" class="score">2</div>
<div id="time_1039509">
42
<img src="images/in.gif" border="0">
</div>
<div id="hs_1039509" class="score">1</div></td>
<td width="90" id="odds_1_1039509" title=""></td>
<td width="90" id="odds_4_1039509" title=""></td>
<td width="90" id="odds_3_1039509" title="">
<a class="sb" href="javascript:" onclick="ChangeDetail3(1039509,'3')">0.94</a>
<img src="images/t3.gif">
<br>
<a class="pk" href="javascript:" onclick="ChangeDetail3(1039509,'3')">2.5/3</a>
<br>
0.86
</td>
<td width="90" id="odds_31_1039509" title="nothing"></td>
</tr>
代码如下:
rows = table.findAll("tr", {"id" : re.compile('tr_*\d')})
for tr in rows:
cols = tr.findAll("span", {"id" : re.compile('t_*\d')}) &
cols = tr.findAll("div", {"id" : re.compile('home_*\d')}) &
cols = tr.findAll("span", {"id" : re.compile('guest_*\d')}) &
cols = tr.findAll("span", {"id" : re.compile('guest_*\d')}) &
cols = tr.findAll("span", {"id" : re.compile('odds_3_*\d')}) &
cols = tr.findAll("span", {"id" : re.compile('hs_*\d')})
for td in cols:
t = td.find(text=True)
if t:
text = t + ';' # concat
print text,
print
2 个回答
1
你可以像这样获取cols的列表:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
soup.find_all(["div", "span"], id=re.compile('[home|guest|odds_3|gs|hs|time]_\d+'))
上面的正则表达式只是一个例子。
在你的情况下,它可以是:
cols = tr.find_all(["div", "span"], id=re.compile('[home|guest|odds|gs|hs|time]_\d+'))
for tag in cols:
# find(text=True) only returns data if immediate node has text
# incase <div><span>123</span></div> will return None
t = td.find_all(text=True)
if t:
# find_all will return list so need to join
text = ''.join(t).strip() + ';'
print(text)
3
你可以传入 一个函数,然后检查 id
是否以 home_
、guest_
等开头:
from bs4 import BeautifulSoup
f = lambda x: x and x.startswith(('home_', 'guest_', 'odds_', 'gs_', 'hs_', 'time_'))
soup = BeautifulSoup(open('test.html'))
print [element.get_text(strip=True) for element in soup.find_all(id=f)]
输出结果是:
[u'U18()', u'U18', u'2', u'42', u'1', u'', u'', u'0.942.5/30.86', u'']
注意,startswith()
这个方法可以接收一个字符串的元组来进行检查。