Python与BeautifulSoup,未能找到'a'标签
这里有一段HTML代码(来自delicious):
<h4>
<a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anonymous Referers & Anti-Bot Protection</a>
<span class="saverem">
<em class="bookmark-actions">
<strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&title=Generate%20Secure%20Links%20with%20Anonymous%20Referers%20%26%20Anti-Bot%20Protection&jump=%2Fdux&key=fFS4QzJW2lBf4gAtcrbuekRQfTY-&original_user=dux&copyuser=dux&copytags=web+apps+url+security+generator+shortener+anonymous+links">SAVE</a></strong>
</em>
</span>
</h4>
我想找到所有类名为“inlinesave action”的链接。以下是我的代码:
sock = urllib2.urlopen('http://delicious.com/theuser')
html = sock.read()
soup = BeautifulSoup(html)
tags = soup.findAll('a', attrs={'class':'inlinesave action'})
print len(tags)
但是它什么都没找到!
有什么想法吗?
谢谢!
4 个回答
0
Python 字符串方法
html=open("file").read()
for item in html.split("<strong>"):
if "class" in item and "inlinesave action" in item:
url_with_junk = item.split('href="')[1]
m = url_with_junk.index('">')
print url_with_junk[:m]
0
你可以尝试使用pyparsing来取得一些进展:
from pyparsing import makeHTMLTags, withAttribute
htmlsrc="""<h4>... etc."""
atag = makeHTMLTags("a")[0]
atag.setParseAction(withAttribute(("class","inlinesave action")))
for result in atag.searchString(htmlsrc):
print result.href
这会产生(长结果输出在'...'处被省略):
/save?url=http%3A%2F%2Fimfy.us%2F&title=Genera...+anonymous+links
1
如果你想找到一个恰好有这两个类的链接,你可能需要用到正则表达式,我觉得是这样的:
tags = soup.findAll('a', attrs={'class': re.compile(r'\binlinesave\b.*\baction\b')})
要记住,如果类名的顺序反过来了(比如 class="action inlinesave"
),这个正则表达式就不管用了。
下面这个写法应该在所有情况下都能用(虽然我觉得它看起来有点丑):
soup.findAll('a',
attrs={'class':
re.compile(r'\baction\b.*\binlinesave\b|\binlinesave\b.*\baction\b')
})