使用BeautifulSoup抓取表格

3 投票
1 回答
2932 浏览
提问于 2025-04-16 00:44

我有一个问题,我觉得应该挺简单的。我想从一个页面上收集最后一个表格的信息(如果你一直往下滚动,它在标记为“Procedure”的框里):

http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-2&language=EN

我想抓取的表格的HTML代码看起来是这样的:

<tbody><tr class="doc_title">
<td style="background-image: url(&quot;/img/struct/navigation/gradient_blue.gif&quot;);" align="left" valign="top"><img src="/img/struct/functional/arrow_title_doc.gif" alt="" align="absmiddle" border="0" height="14" width="8"> <span style="font-weight: bold;">PROCEDURE</span></td><td style="background-image: url(&quot;/img/struct/navigation/gradient_blue.gif&quot;);" align="right" valign="top">
<table border="0" cellpadding="3" cellspacing="0" width="50">
<tbody><tr><td align="center"><a href="#top"><img src="/img/struct/functional/top_doc.gif" alt="" border="0" height="16" width="16"></a></td><td align="center"><img src="/img/struct/navigation/spacer.gif" alt="" border="0" height="10" width="15"></td><td align="center"><a href="#title2"><img src="/img/struct/functional/sort_up.gif" alt="" border="0" height="10" width="15"></a></td></tr></tbody></table></td></tr>

<tr class="contents" valign="top"><td colspan="2">
<p></p><table style="border-collapse: collapse; width: 481.85pt;" align="center" cellspacing="0">
<tbody><tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Title</span></p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">Mutual assistance for the recovery of claims relating to taxes, duties and other measures</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">References</span></p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style=""><a href="http://ec.europa.eu/prelex/liste_resultats.cfm?CL=en&amp;ReqId=0&amp;DocType=COM&amp;DocYear=2009&amp;DocNum=0028">COM(2009)0028</a> – C6-0061/2009 – <a href="/oeil/FindByProcnum.do?lang=en&amp;procnum=CNS/2009/0007">2009/0007(CNS)</a></p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Date of consulting Parliament</span></p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">16.2.2009</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Committee responsible</span></p>

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date announced in plenary</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">ECON</p>

<p style="">19.10.2009</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0pt 0.75pt; border-style: solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Committee(s) asked for opinion(s)</span></p>

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date announced in plenary</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2">
<p style="">CONT</p>

<p style="">19.10.2009</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">JURI</p>

<p style="">19.10.2009</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 1pt 0pt 0pt; border-style: solid solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Not delivering opinions</span></p>

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date of decision</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2">
<p style="">CONT</p>

<p style="">1.10.2009</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">JURI</p>

<p style="">5.10.2009</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Rapporteur(s)</span></p>

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date appointed</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="3">
<p style="">Theodor Dumitru Stolojan</p>

<p style="">21.7.2009</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0pt 0.75pt; border-style: solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Discussed in committee</span></p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2">
<p style="">10.11.2009</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">1.12.2009</p>
</td>
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">21.1.2010</p>
</td>
<td style="border-width: 0.75pt 1pt 0pt 0pt; border-style: solid solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Date adopted</span></p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2">
<p style="">27.1.2010</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2">
<p style="">&nbsp;</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1">
<p style="">&nbsp;</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Result of final vote</span></p>
</td>
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 12.94%;" rowspan="1" colspan="1">
<p style="">+:</p>

<p style="">–:</p>

<p style="">0:</p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 48.82%;" rowspan="1" colspan="6">
<p style="">39</p>

<p style="">0</p>

<p style="">1</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Members present for the final vote</span></p>
</td>
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">Burkhard Balz, Sharon Bowles, Udo Bullmann, Pascal Canfin, Nikolaos Chountis, George Sabin Cutaş, Leonardo Domenici, Derk Jan Eppink, Markus Ferber, Elisa Ferreira, Vicky Ford, José Manuel García-Margallo y Marfil, Jean-Paul Gauzès, Sylvie Goulard, Enikő Győri, Liem Hoang Ngoc, Eva Joly, Othmar Karas, Wolf Klinz, Jürgen Klute, Werner Langen, Astrid Lulling, Arlene McCarthy, Ivari Padar, Alfredo Pallone, Anni Podimata, Antolín Sánchez Presedo, Olle Schmidt, Edward Scicluna, Peter Simon, Peter Skinner, Theodor Dumitru Stolojan, Ivo Strejček, Kay Swinburne, Marianne Thyssen, Ramon Tremosa i Balcells</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border-left: 0.75pt solid rgb(0, 0, 0); border-right: 1pt solid rgb(0, 0, 0); border-top: 0.75pt solid rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1">
<p style=""><span style="font-weight: bold;">Substitute(s) present for the final vote</span></p>
</td>
<td style="border-left: 0.75pt solid rgb(0, 0, 0); border-right: 1pt solid rgb(0, 0, 0); border-top: 0.75pt solid rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7">
<p style="">Marta Andreasen, Sophie Briard Auconie, David Casa, Danuta Jazłowiecka, Arturs Krišjānis Kariņš, Philippe Lamberts, Andreas Schwab</p>
</td>
<td style="" rowspan="1" colspan="1"></td></tr>

<tr style="">
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 38.24%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 12.94%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 2.94%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 4.71%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 10.58%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 10%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 5.29%;" rowspan="1" colspan="1"></td>
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 15.3%;" rowspan="1" colspan="1"></td>
<td style="" rowspan="1" colspan="1"></td></tr>
</tbody></table>
</td></tr>
</tbody>

我现在遇到的问题是,这些表格的标签没有标识符(就我所知),所以我不知道怎么选择这个表格并抓取里面的信息。我一直在用BeautifulSoup来获取网站上的其他信息,但对于这个表格我不知道该怎么做。

如果有人能告诉我该怎么继续,我将非常感激!

祝好,

托马斯

1 个回答

3

如果你稍微聪明一点,可以通过其他属性找到元素。我尝试了一下抓取你的数据,可能不是最好的方法,但离目标差不多了。

我注意到你确实想要在第二次出现“PROCEDURE”这个词后获取数据(第一次是链接,第二次是标题)。所以,我就在这个地方进行分割:

data = html.split("PROCEDURE", 2)[2]

然后,我寻找带有 <td> 标签且 rowspan=1 的元素:

bs = BeautifulSoup.BeautifulSoup(data)
tds = bs.findAll("td", { "rowspan": 1 })

越来越接近了……

>>> tds[0].text
u'Title'
>>> tds[1].text
u'Mutual assistance for the recovery of claims relating to taxes, duties and other measures'
>>> tds[3].text
u'References'
>>> tds[4].text
u'COM(2009)00282009/0007(CNS)2009 a>'

注意,我跳过了 tds 中的索引 2,因为它们使用了一个空白元素(就是空的)。不过,这算是一个开始。我发现使用 BeautifulSoup 的真正技巧是,只给它你知道要找的数据区域,这样就能减少需要浏览的内容。它也很擅长处理看起来不太好的输入,所以不要害怕给它一些杂乱的数据。

我在元素列表中进一步探索,但结果并不完美。你需要进一步细化搜索,因为它们在 <td> 中还有其他的 <td> 元素来存放值。

撰写回答