Beautiful Soup的extract()出错

0 投票

2 回答

2146 浏览

提问于 2025-04-15 11:33

我正在开发一些屏幕抓取软件，遇到了Beautiful Soup的问题。我使用的是python 2.4.3和Beautiful Soup 3.0.7a。

我需要删除一个<hr>标签，但这个标签可能有很多不同的属性，所以简单的replace()方法不够用。

给定以下的html：

<h1>foo</h1>
<h2><hr/>bar</h2>

还有以下的代码：

soup = BeautifulSoup(string)

bad_tags = soup.findAll('hr');
[tag.extract() for tag in bad_tags] 

for i in soup.findAll(['h1', 'h2']):
    print i
    print i.string

输出结果是：

<h1>foo</h1>
foo
<h2>bar</h2>
None

我是不是对extract函数理解错了，还是说这是Beautiful Soup的一个bug？

software development data extraction web scraping beautiful soup html parsing bug report extract function

2 个回答

我也遇到了同样的问题。
我不知道具体原因，但我猜可能和BS（Beautiful Soup）创建的空元素有关。

比如说，如果我有以下这段代码：

from bs4 import BeautifulSoup

html ='            \
<a>                \
    <b test="help">            \
        hello there!  \
        <d>        \
        now what?  \
        </d>    \
        <e>        \
            <f>        \
            </f>    \
        </e>    \
    </b>        \
    <c>            \
    </c>        \
</a>            \
'

soup = BeautifulSoup(html,'lxml')
#print(soup.find('b').attrs)

print(soup.find('b').contents)

t = soup.find('b').findAll()
#t.reverse()
for c in t:
    gb = c.extract()

print(soup.find('b').contents)

soup.find('b').text.strip()

我得到了以下错误：

'NoneType'对象没有'next_element'这个属性

在第一次打印时，我得到了：

>>> print(soup.find('b').contents)
[u' ', <d> </d>, u' ', <e> <f> </f> </e>, u' ']

而在第二次打印时，我得到了：

>>> print(soup.find('b').contents)
[u' ', u' ', u' ']

我很确定是中间的那个空元素导致了这个问题。

我找到的一个解决办法是重新创建这个“汤”：

soup = BeautifulSoup(str(soup))
soup.find('b').text.strip()

现在它打印的结果是：

>>> soup.find('b').text.strip()
u'hello there!'

希望这能帮到你。

回答于 2025-04-15 由 Python大师

分享举报

这可能是个错误。不过幸运的是，你还有其他方法可以获取这个字符串：

from BeautifulSoup import BeautifulSoup

string = \
"""<h1>foo</h1>
<h2><hr/>bar</h2>"""

soup = BeautifulSoup(string)

bad_tags = soup.findAll('hr');
[tag.extract() for tag in bad_tags] 

for i in soup.findAll(['h1', 'h2']):
    print i, i.next

# <h1>foo</h1> foo
# <h2>bar</h2> bar

回答于 2025-04-15 由 Python大师

分享举报

Beautiful Soup的extract()出错

2 个回答

撰写回答