用BeautifulSoup抓取多个段落

10 投票
3 回答
8003 浏览
提问于 2025-04-17 07:25

我正在尝试使用BeautifulSoup从一个网站上抓取一段演讲的内容。不过,我遇到了一些问题,因为这段演讲被分成了很多不同的段落。对于编程我还是个新手,所以不太知道该怎么处理这些段落。这个网页的HTML结构大概是这样的:

<span class="displaytext">Thank you very much. Mr. Speaker, Vice President Cheney, 
Members of Congress, distinguished guests, fellow citizens: As we gather tonight, our Nation is    
at war; our economy is in recession; and the civilized world faces unprecedented dangers. 
Yet, the state of our Union has never been stronger.
<p>We last met in an hour of shock and suffering. In 4 short months, our Nation has comforted the victims, 
begun to rebuild New York and the Pentagon, rallied a great coalition, captured, arrested, and  
rid the world of thousands of terrorists, destroyed Afghanistan's terrorist training camps, 
saved a people from starvation, and freed a country from brutal oppression. 
<p>The American flag flies again over our Embassy in Kabul. Terrorists who once occupied 
Afghanistan now occupy cells at Guantanamo Bay. And terrorist leaders who urged followers to 
sacrifice their lives are running for their own.

接下来就是很多段落标签,内容也是这样。我想提取所有在标签里的文字。

我试过几种不同的方法来获取这些文字,但都没能成功。

我尝试的第一种方法是:

import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
html = urllib2.urlopen(address).read()

soup = BeautifulSoup(html)
thespan = soup.find('span', attrs={'class': 'displaytext'})
print thespan.string

结果是:

尊敬的议长,切尼副总统,国会议员们,尊贵的来宾,亲爱的公民们:今晚我们聚在一起,我们的国家正处于战争中;我们的经济正在衰退;文明世界面临前所未有的危险。然而,我们的国家状况从未如此强大。

这是我得到的文字,直到第一个段落标签为止。然后我又尝试了:

import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
html = urllib2.urlopen(address).read()

soup = BeautifulSoup(html)
thespan = soup.find('span', attrs={'class': 'displaytext'})
for section in thespan:
     paragraph = section.findNext('p')
     if paragraph and paragraph.string:
         print '>', paragraph.string
     else:
         print '>', section.parent.next.next.strip()

这次我得到了第一个段落标签和第二个段落标签之间的文字。所以,我希望能找到一种方法,获取整个文本,而不仅仅是某些部分。

相关问题:

3 个回答

0

你可以试试:

soup.span.renderContents()
2

下面是如何使用 lxml 来实现的:

import lxml.html as lh

tree = lh.parse('http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW')

text = tree.xpath("//span[@class='displaytext']")[0].text_content()

另外,这个问题的回答中也介绍了如何使用 beautifulsoup 达到同样的效果: BeautifulSoup - 获取不带HTML的内容的简单方法

来自被接受答案的辅助函数:

def textOf(soup):
    return u''.join(soup.findAll(text=True))
8

import urllib2,sys
from BeautifulSoup import BeautifulSoup

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
soup = BeautifulSoup(urllib2.urlopen(address).read())

span = soup.find("span", {"class":"displaytext"})  # span.string gives you the first bit
paras = [x.contents[0] for x in span.findAllNext("p")]  # this gives you the rest
# use .contents[0] instead of .string to deal with last para that's not well formed

print "%s\n\n%s" % (span.string, "\n\n".join(paras))

正如评论中提到的,如果

标签里面有更多的嵌套标签,上面的做法就不太好用了。这种情况可以用以下方法来解决:

paras = ["".join(x.findAll(text=True)) for x in span.findAllNext("p")]

不过,这种方法在最后一个没有闭合标签的

上效果也不太好。一个比较“hacky”的解决办法是把它单独处理。例如:

import urllib2,sys
from BeautifulSoup import BeautifulSoup

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
soup = BeautifulSoup(urllib2.urlopen(address).read())
span = soup.find("span", {"class":"displaytext"})  
paras = [x for x in span.findAllNext("p")]

start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]
print "%s\n\n%s\n\n%s" % (start, middle, last)

撰写回答