使用BeautifulSoup处理无效的HTML文档

0 投票
2 回答
1413 浏览
提问于 2025-04-18 04:37

我正在尝试解析这个文档 http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/5923en8.htm。我想提取所有在 Commission: 之前的内容。

enter image description here

(我需要使用BeautifulSoup,因为第二步是提取国家和人名)

如果我这样做:

import urllib
import re
from bs4 import BeautifulSoup
url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/5923en8.htm"
soup=BeautifulSoup(urllib.urlopen(url))
print soup.find_all(text=re.compile("Commission"))

我得到的结果只有:

[u'The Governments of the Member States and the European Commission were represented as follows:']

这是这个词第一次出现的地方,但不是我想要的那一行。我觉得可能是因为文档格式不正确,但不太确定。如果我查看源代码:

<B><U><P>Commission</B></U>:</P>

但是如果我打印 soup,我可以看到文本,标签的顺序被重新排列了:

<u><b>Commission</b></u>

我该如何获取这个元素 "Commission:"

我使用的是python 2.7和BeautifulSoup 4.3.2。


编辑:已解决!

根据alecxe的建议,我把这一行替换成:

soup=BeautifulSoup(urllib.urlopen(url))

然后用

BeautifulSoup(urllib.urlopen(url), 'html.parser')

现在可以正常工作了 :)。感谢大家。


编辑:类似的问题

我发现有类似的问题,解决方法也是一样:

Beautiful Soup 4的find_all找不到Beautiful Soup 3能找到的链接

Beautiful Soup的findAll没有找到所有内容

相关问题:

2 个回答

1

如果你想要获取“Commision:”这个标签之前的所有内容,其实可以不使用beautifulsoup这个库。你可以把它当作一个普通的字符串变量,直接搜索你想要的关键词,然后把其他的内容去掉。

不过,当我运行你的代码时,我得到了以下结果:

[u'The Governments of the Member States and the European Commission were represe
nted as follows:', u'Commission', u'The Council held an orientation debate on ke
y economic policy issues with a view to giving guidance to the Commission on the
 questions Ministers wish to be addressed in the broad economic policy guideline
s 1998/99 for which the Commission will present its recommandation later in the
Spring. It was noted that the forthcoming guidelines are of particular importanc
e given the start of stage 3 of EMU.', u'The debate was based on an assessment o
f the economic situation and outlook in the Community carried out by the Commiss
ion and the Economic Policy and Monetary Committees.', u"The Council held an ori
entation debate on the Commission's Communication setting out a possible Communi
ty framework allowing Member States to experiment with reduced VAT rates for lab
our-intensive services in order to boost employment in small businesses without
distorting international competition. ", u'This Communication was tabled by the
Commission as a follow-up to the Employment European Council of last November in
 Luxembourg, which concluded that, in order to make the taxation system more emp
loyment-friendly, "Member States will examine, without obligation, the advisabil
ity of reducing the rate of VAT on labour-intensive services not exposed to cros
s-border competition".', u"In conclusion, the Council invited Coreper to examine
 the technical questions arising from today's debate and to report back to it wi
th a view to deciding on a possible request to the Commission to submit a propos
al in this area. ", u"This technical examination should be carried out, taking i
nto account the criteria indicated in the Commission's Communication for a reduc
ed VAT rate, on the following questions :", u'An initial trial period running un
til the year 2002 should identify the best method for allocating FISIM. At the e
nd of this period, the Commission will assess the results of the trial period an
d decide, by means of a comitology procedure, on the final methodology to be app
lied. However, a unanimous decision by the Council would be needed in order to u
se the new methodology in budgetary calculations on other Community policies and
 notably concerning "own resources".']
0

遍历所有的 p 元素,直到找到一个以 Commission 开头的文本为止:

import urllib
from bs4 import BeautifulSoup

url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/5923en8.htm"
soup=BeautifulSoup(urllib.urlopen(url))

for item in soup.find_all('p'):
    if item.text.startswith('Commission'):
        break
    else:
        print item.text

它会打印出所有在 Commission 之前的内容:

The Governments of the Member States and the European Commission were represented as follows:
Belgium:
...
Ms Helen LIDDELL            Economic Secretary to the Treasury
* * *

撰写回答