使用BeautifulSoup提取换行符之间的文本（如<br />标签）

22 投票

4 回答

42393 浏览

提问于 2025-04-16 13:31

我有以下的HTML代码，它是一个更大文档的一部分。

<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />

我现在正在使用BeautifulSoup这个工具来获取HTML中的其他元素，但我还没找到办法来提取<br />标签之间的重要文本。我可以找到每个<br />元素，但就是找不到它们之间的文本。任何帮助都将非常感谢。谢谢。

4 个回答

这是对Ken Kinder回答的一个小改进。你可以直接访问BeautifulSoup元素的stripped_strings属性。比如说，你的HTML片段是在一个span标签里面：


x = """<span><br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br /></span>"""

首先，我们用BeautifulSoup来解析x。然后查找这个元素，在这个例子中是span，接着访问stripped_strings属性。像这样：

from bs4 import BeautifulSoup
soup = BeautifulSoup(x)
span = soup.find("span")
text = list(span.stripped_strings)

现在，执行print(text)会输出以下内容：

['Important Text 1',
 'Not Important Text',
 'Important Text 2',
 'Important Text 3',
 'Non Important Text',
 'Important Text 4']

回答于 2025-04-16 由 Python大师

分享举报

好吧，为了测试，我们假设这段HTML代码放在一个 span 标签里：

x = """<span><br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br /></span>"""

现在我将解析它，并找到我的span标签：

from BeautifulSoup import BeautifulSoup
y = soup.find('span')

如果你在 y.childGenerator() 中循环遍历生成器，你会得到所有的换行符和文本：

In [4]: for a in y.childGenerator(): print type(a), str(a)
   ....: 
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 1

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Not Important Text

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 2

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 3

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Non Important Text

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 4

<type 'instance'> <br />

回答于 2025-04-16 由 Python大师

分享举报

如果你只是想获取两个 <br /> 标签之间的任何文本，你可以这样做：

from BeautifulSoup import BeautifulSoup, NavigableString, Tag

input = '''<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />'''

soup = BeautifulSoup(input)

for br in soup.findAll('br'):
    next_s = br.nextSibling
    if not (next_s and isinstance(next_s,NavigableString)):
        continue
    next2_s = next_s.nextSibling
    if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
        text = str(next_s).strip()
        if text:
            print "Found:", next_s

不过，也许我理解错了你的问题？你描述的问题似乎和你示例数据中的“重要”和“非重要”不太一致，所以我就按照你的描述来处理了；)

回答于 2025-04-16 由 Python大师

分享举报

使用BeautifulSoup提取换行符之间的文本（如<br />标签）

4 个回答

撰写回答