Python和Beautiful Soup - 搜索标签a,直到找到标签A为止,返回后续标签b
我有两个变量,一个是“最后的卷号”,另一个是“最后的期号”。
我正在处理的HTML里面有一个包含所有卷号和期号的列表,最新的在最前面。
我需要找出所有比我手头的卷号和期号更新的链接。
举个例子,如果我的最后卷号是13,最后期号是1,那么我需要返回卷13的链接、卷14的链接和期2的链接。
这对我来说有点困难,因为卷号是单独列出来的……
这是我目前的进展:
HTML:
<ul class="bobby">
<li><strong>Volume 14</strong></li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, September 2011">Issue 1, September 2011</a>
</li>
<li><strong>Volume 13</strong></li>
<li class="">
<a href="/content/ben/cchts/2010/00000013/00000002" title="Issue 2, December 2010">Issue 2, December 2010</a>
</li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, November 2011">Issue 1, November 2011</a>
</li>
</ul>
脚本片段:
results = soup.find('ul', attrs={'class' : 'bobby'})
#temp until I get it reading from file
lastVol = '13'
#find the last volume
findlastVol = results.findNext('strong', text= re.compile('Volume ' + lastVol))
#temp until I get it reading from file
lastIss = '2'
#find the last issue
findlastIss = findlastVol.findNext('a', text= re.compile('Issue ' + lastIss))
我能找到文件中最后的卷号和期号的标签,但我尝试了好几次都没能成功地往回查找并在第一个期号处停止……
或者从顶部开始,往下查找,直到满足卷号和期号的条件……
有人能帮帮我吗?谢谢。
2 个回答
0
from BeautifulSoup import BeautifulSoup
content = '''<ul class="bobby">
<li><strong>Volume 14</strong></li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, September 2011">Issue 1, September 2011</a>
</li>
<li><strong>Volume 13</strong></li>
<li class="">
<a href="/content/ben/cchts/2010/00000013/00000002" title="Issue 2, December 2010">Issue 2, December 2010</a>
</li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, November 2011">Issue 1, November 2011</a>
</li>
</ul>
'''
soup = BeautifulSoup(content)
soup.prettify()
last_vol = 13
last_issue = 1
res = soup.find('ul',{"class":"bobby"})
lis = res.findAll('li')
for j in lis:
if(j.find('strong') != None):
vol = int(j.contents[0].string[7:])
elif(vol > last_vol) or (vol == last_vol and int(j.contents[1]['href'][33:]) > last_issue):
print "Volume\t:%d" % vol
print j.contents[1].string
print "href\t:%s" % j.contents[1]['href']
给出
Volume :14 Issue 1, September 2011 href :/content/ben/cchts/2011/00000014/00000001 Volume :13 Issue 2, December 2010 href :/content/ben/cchts/2010/00000013/00000002
1
我觉得你在找的是 findPrevious,你可以这样使用它:
import BeautifulSoup
import re
content='''
<ul class="bobby">
<li><strong>Volume 14</strong></li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, September 2011">Issue 1, September 2011</a>
</li>
<li><strong>Volume 13</strong></li>
<li class="">
<a href="/content/ben/cchts/2010/00000013/00000002" title="Issue 2, December 2010">Issue 2, December 2010</a>
</li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, November 2011">Issue 1, November 2011</a>
</li>
</ul>
'''
last_volume=13
last_issue=1
soup=BeautifulSoup.BeautifulSoup(content)
results = soup.find('ul', attrs={'class' : 'bobby'})
for a_string in results.findAll('a', text=re.compile('Issue')):
volume=a_string.findPrevious(text=re.compile('Volume'))
volume=int(re.search(r'(\d+)',volume).group(1))
issue=int(re.search(r'(\d+)',a_string).group(1))
href=a_string.parent['href']
if (volume>last_volume) or (volume>=last_volume and issue>last_issue):
print(volume,issue,href)
这样会得到
(14, 1, u'/content/ben/cchts/2011/00000014/00000001')
(13, 2, u'/content/ben/cchts/2010/00000013/00000002')