使用BeautifulSoup提取换行符之间的文本(如<br />标签)
我有以下的HTML代码,它是一个更大文档的一部分。
<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />
我现在正在使用BeautifulSoup这个工具来获取HTML中的其他元素,但我还没找到办法来提取<br />
标签之间的重要文本。我可以找到每个<br />
元素,但就是找不到它们之间的文本。任何帮助都将非常感谢。谢谢。
4 个回答
3
这是对Ken Kinder回答的一个小改进。你可以直接访问BeautifulSoup元素的stripped_strings
属性。比如说,你的HTML片段是在一个span
标签里面:
x = """<span><br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br /></span>"""
首先,我们用BeautifulSoup来解析x
。然后查找这个元素,在这个例子中是span
,接着访问stripped_strings
属性。像这样:
from bs4 import BeautifulSoup
soup = BeautifulSoup(x)
span = soup.find("span")
text = list(span.stripped_strings)
现在,执行print(text)
会输出以下内容:
['Important Text 1',
'Not Important Text',
'Important Text 2',
'Important Text 3',
'Non Important Text',
'Important Text 4']
8
好吧,为了测试,我们假设这段HTML代码放在一个 span
标签里:
x = """<span><br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br /></span>"""
现在我将解析它,并找到我的span标签:
from BeautifulSoup import BeautifulSoup
y = soup.find('span')
如果你在 y.childGenerator()
中循环遍历生成器,你会得到所有的换行符和文本:
In [4]: for a in y.childGenerator(): print type(a), str(a)
....:
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Important Text 1
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Not Important Text
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Important Text 2
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Important Text 3
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Non Important Text
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'>
Important Text 4
<type 'instance'> <br />
35
如果你只是想获取两个 <br />
标签之间的任何文本,你可以这样做:
from BeautifulSoup import BeautifulSoup, NavigableString, Tag
input = '''<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />'''
soup = BeautifulSoup(input)
for br in soup.findAll('br'):
next_s = br.nextSibling
if not (next_s and isinstance(next_s,NavigableString)):
continue
next2_s = next_s.nextSibling
if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
text = str(next_s).strip()
if text:
print "Found:", next_s
不过,也许我理解错了你的问题?你描述的问题似乎和你示例数据中的“重要”和“非重要”不太一致,所以我就按照你的描述来处理了;)