如何在Python的Mechanize中访问嵌套标签的HTML属性?
大家好。我在用Python的Mechanize库处理嵌套HTML中的链接时遇到了麻烦。以下是我现在的代码(我试过很多方法,这只是最新的版本,但还是不太对劲)(请原谅我用的变量名(thing, stuff)):
soup = BeautifulSoup(resultsPage)
if not soup.find(attrs={'class' : 'paging'}):
print "Only one producted listed!"
else:
stuff = soup.find('div', attrs={'class' : 'paging'}).ul.li
for thing in stuff:
print thing
这是我正在查看的HTML:
<div class="paging">
<ul>
<li><
</li>
<li class='on'>
1-10
</li>
<li class=''>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl01_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=2">11-20</a>
</li>
<li class=''>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl02_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=3">21-30</a>
</li>
<li class=''>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl03_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=4">31-40</a>
</li>
<li class=''>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl04_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=5">41-50</a>
</li>
<li class=''>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl05_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=6">51-60</a>
</li>
<li>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_lnkNext" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=7">>></a>
</li>
</ul>
我需要确定是否有带超链接的<li>
标签;如果有的话,我需要把它们存起来,以便后面点击。这段代码来自这个页面,如果你感兴趣的话: http://www.kraftrecipes.com/Products/ProductInfoSearchResults.aspx?CatalogType=1&BrandId=22&SearchText=Jell-O&PageNo=1 我正在做一个抓取食品网站产品信息的项目,需要能够在搜索结果中导航。
我还有一个小问题。像这样把标签和搜索串在一起,是不是不好?
ingredients = soup.find(attrs={'class' : "TitleAndDescription"}).div.find(text=re.compile("Ingredients")).next
我刚开始学习Python,但觉得这样做有点笨拙,想知道你们的看法。这里有一段我正在抓取的HTML示例:
<table>
<tr>
<td>
<div id="contHeader" class="TitleAndDescription">
<h1>JELL-O - GELATIN DESSERT - RASPBERRY</h1>
<div class="textArea">
<strong>Ingredients:</strong> SUGAR, GELATIN, ADIPIC ACID (FOR TARTNESS), CONTAINS LESS THAN 2% OF ARTIFICIAL FLAVOR, DISODIUM PHOSPHATE AND SODIUM CITRATE (CONTROL ACIDITY), FUMARIC ACID (FOR TARTNESS), RED 40.<br/>
<strong>Size:</strong> 6 OZ<br/><strong>Upc:</strong> 4300020052<br/>
<br/>
<!--<br/>-->
<br/>
</div>
</div>
...
</td>
...
</tr>
...
</table>
抱歉文字有点多。如果你们需要更多信息,请告诉我。
谢谢。
2 个回答
0
如果我理解得没错,你想要的是一个包含所有有标签的
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(resultsPage)
list_items = [list_item for list_item in soup.findAll('li')
if list_item.findAll('a')]
1
Python的"HTMLParser"模块可能是解决这个问题的一个方法。想了解更多细节,可以查看这个链接:http://docs.python.org/library/htmlparser.html