BeautifulSoup解析问题

0 投票

2 回答

846 浏览

提问于 2025-04-17 00:27

<h2 class="sectionTitle">BACKGROUND</h2>
Mr. Paul J. Fribourg has bla bla</span>
<div style="margin-top:8px;">
    <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
</div>

我想从保罗先生那里提取一些信息。某些网页在保罗先生前面有一个标签，所以我可以用FindNext('p')来找到他。

不过，有些网页像上面的例子那样没有标签。

这是我在有标签时的代码。

background = bs2.find(text=re.compile("BACKGROUND"))
bb= background.findNext('p').contents

但是当我没有标签时，我该如何提取信息呢？

信息提取 beautifulsoup 网页解析标签处理

2 个回答

"有些网页在保罗先生前面有标签，所以我可以用FindNext('p')来找到它。但是，有些网页像上面的例子一样没有标签。”

你提供的信息不够，无法识别你的字符串：

固定的节点结构，比如说可以用getChildren()[1].getChildren()[0].text来获取。
如果在你的代码中，它前面有一个神奇的字符串'BACKGROUND'，那么你找下一个节点的方法看起来不错——只要别假设标签名一定是'p'就行。
可以用正则表达式，比如说"(Mr.|Ms.) ..."。

能给我们一个没有标签在名字前面的HTML例子吗？

回答于 2025-04-17 由 Python大师

分享举报

从你给的例子来看，具体情况有点难判断，但我觉得你可以直接获取一个

标签后面的下一个节点。在这个例子中，刘易斯·卡罗尔有一个

段落标签，而你的朋友保罗只有一个闭合的标签：

>>> from BeautifulSoup import BeautifulSoup
>>>
>>> html = '''
... <h2 class="sectionTitle">BACKGROUND</h2>
... <p>Mr. Lewis Carroll has bla bla</p>
... <div style="margin-top:8px;">
...     <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... <h2 class="sectionTitle">BACKGROUND</h2>
... Mr. Paul J. Fribourg has bla bla</span>
... <div style="margin-top:8px;">
...     <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... '''
>>>
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
...     p = section.findNext('p')
...     if p:
...         print '> ',  p.string
...     else:
...         print '> ', section.parent.next.next.strip()
...
>  Mr. Lewis Carroll has bla bla
>  Mr. Paul J. Fribourg has bla bla

接下来的评论：

>>> from BeautifulSoup import BeautifulSoup
>>> from urllib2 import urlopen
>>> html = urlopen('http://investing.businessweek.com/research/stocks/private/person.asp?personId=668561&privcapId=160900&previousCapId=285930&previousTitle=LOEWS%20CORP')
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
...     paragraph = section.findNext('p')
...     if paragraph and paragraph.string:
...         print '> ', paragraph.string
...     else:
...         print '> ', section.parent.next.next.strip()
... 
>  Mr. Paul J. Fribourg has been the President of Contigroup Companies Inc. (for [...]

回答于 2025-04-17 由 Python大师

分享举报

BeautifulSoup解析问题

2 个回答

撰写回答