用BeautifulSoup提取标签内的内容

50 投票

4 回答

111632 浏览

提问于 2025-04-16 17:37

我想提取内容 Hello world。请注意，页面上有多个 <table> 和类似的 <td colspan="2"> 标签：

<table border="0" cellspacing="2" width="800">
  <tr>
    <td colspan="2"><b>Name: </b>Hello world</td>
  </tr>
  <tr>
...

我尝试了以下方法：

hello = soup.find(text='Name: ')
hello.findPreviousSiblings

但是没有返回任何结果。

另外，我在提取 My home address 时也遇到了问题：

<td><b>Address:</b></td>

<td>My home address</td>

我也在用同样的方法搜索 text="Address: "，但是我该如何向下移动到下一行并提取 <td> 的内容呢？

文本解析 beautifulsoup 网页解析数据抓取 HTML内容标签提取网页内容处理

4 个回答

使用下面的代码可以用Python的BeautifulSoup库从HTML标签中提取文本和内容。

s = '<td>Example information</td>' # your raw html
soup =  BeautifulSoup(s) #parse html with BeautifulSoup
td = soup.find('td') #tag of interest <td>Example information</td>
td.text #Example information # clean text from html

回答于 2025-04-16 由 Python大师

分享举报

使用 next 来代替

>>> s = '<table border="0" cellspacing="2" width="800"><tr><td colspan="2"><b>Name: </b>Hello world</td></tr><tr>'
>>> soup = BeautifulSoup(s)
>>> hello = soup.find(text='Name: ')
>>> hello.next
u'Hello world'

next 和 previous 让你可以按照解析器处理文档元素的顺序来移动，而兄弟方法则是操作解析树。

回答于 2025-04-16 由 Python大师

分享举报

contents 操作符非常适合从 <tag>text</tag> 中提取 text 内容。

比如说，<td>我的家庭地址</td> 这个例子：

s = '<td>My home address</td>'
soup =  BeautifulSoup(s)
td = soup.find('td') #<td>My home address</td>
td.contents #My home address

再比如，<td><b>地址：</b></td> 这个例子：

s = '<td><b>Address:</b></td>'
soup =  BeautifulSoup(s)
td = soup.find('td').find('b') #<b>Address:</b>
td.contents #Address:

回答于 2025-04-16 由 Python大师

分享举报

用BeautifulSoup提取标签内的内容

4 个回答

撰写回答