Python和Beautiful Soup - 搜索标签a,直到找到标签A为止,返回后续标签b

1 投票
2 回答
2345 浏览
提问于 2025-04-17 01:34

我有两个变量,一个是“最后的卷号”,另一个是“最后的期号”。

我正在处理的HTML里面有一个包含所有卷号和期号的列表,最新的在最前面。

我需要找出所有比我手头的卷号和期号更新的链接。

举个例子,如果我的最后卷号是13,最后期号是1,那么我需要返回卷13的链接、卷14的链接和期2的链接。

这对我来说有点困难,因为卷号是单独列出来的……

这是我目前的进展:

HTML:

<ul class="bobby">
<li><strong>Volume 14</strong></li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, September 2011">Issue 1, September 2011</a>          
</li>
<li><strong>Volume 13</strong></li> 
<li class="">
<a href="/content/ben/cchts/2010/00000013/00000002" title="Issue 2, December 2010">Issue 2, December 2010</a>
</li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, November 2011">Issue 1, November 2011</a>
</li>
</ul>

脚本片段:

results = soup.find('ul', attrs={'class' : 'bobby'})

#temp until I get it reading from file
lastVol = '13'
#find the last volume
findlastVol = results.findNext('strong', text= re.compile('Volume ' + lastVol))

#temp until I get it reading from file
lastIss = '2'
#find the last issue
findlastIss = findlastVol.findNext('a', text= re.compile('Issue ' + lastIss))

我能找到文件中最后的卷号和期号的标签,但我尝试了好几次都没能成功地往回查找并在第一个期号处停止……

或者从顶部开始,往下查找,直到满足卷号和期号的条件……

有人能帮帮我吗?谢谢。

2 个回答

0
from BeautifulSoup import BeautifulSoup
content = '''<ul class="bobby">
<li><strong>Volume 14</strong></li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, September     2011">Issue 1, September 2011</a>          
</li>
<li><strong>Volume 13</strong></li> 
<li class="">
<a href="/content/ben/cchts/2010/00000013/00000002" title="Issue 2, December 2010">Issue 2, December 2010</a>
</li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, November 2011">Issue 1, November 2011</a>
</li>
</ul>
'''
soup = BeautifulSoup(content)
soup.prettify()
last_vol = 13
last_issue = 1

res = soup.find('ul',{"class":"bobby"})
lis = res.findAll('li')
for j in lis:
    if(j.find('strong') != None):
        vol = int(j.contents[0].string[7:])
    elif(vol > last_vol) or (vol == last_vol and int(j.contents[1]['href'][33:]) > last_issue): 
        print "Volume\t:%d" % vol
        print j.contents[1].string
        print "href\t:%s" % j.contents[1]['href']

给出

Volume  :14  
Issue 1, September 2011  
href    :/content/ben/cchts/2011/00000014/00000001  
Volume  :13  
Issue 2, December 2010  
href    :/content/ben/cchts/2010/00000013/00000002 
1

我觉得你在找的是 findPrevious,你可以这样使用它:

import BeautifulSoup
import re

content='''
<ul class="bobby">
<li><strong>Volume 14</strong></li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, September 2011">Issue 1, September 2011</a>          
</li>
<li><strong>Volume 13</strong></li> 
<li class="">
<a href="/content/ben/cchts/2010/00000013/00000002" title="Issue 2, December 2010">Issue 2, December 2010</a>
</li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, November 2011">Issue 1, November 2011</a>
</li>
</ul>
'''

last_volume=13
last_issue=1

soup=BeautifulSoup.BeautifulSoup(content)
results = soup.find('ul', attrs={'class' : 'bobby'})
for a_string in results.findAll('a', text=re.compile('Issue')):
    volume=a_string.findPrevious(text=re.compile('Volume'))
    volume=int(re.search(r'(\d+)',volume).group(1))
    issue=int(re.search(r'(\d+)',a_string).group(1))
    href=a_string.parent['href']
    if (volume>last_volume) or (volume>=last_volume and issue>last_issue):    
        print(volume,issue,href)

这样会得到

(14, 1, u'/content/ben/cchts/2011/00000014/00000001')
(13, 2, u'/content/ben/cchts/2010/00000013/00000002')

撰写回答