如何使用Python和BeautifulSoup解析script标签
我想从一个页面中提取一个在 document.write
函数里的 frame 标签的属性,页面的代码如下:
<script language="javascript">
.
.
.
document.write('<frame name="nav" src="/nav/index_nav.html" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" border = "no" noresize>');
if (anchor != "") {
document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html?' + anchor + '" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>');
} else {
document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>');
}
document.write('</frameset>');
// end hiding -->
</script>
findAll('frame')
方法没有帮上忙。有没有办法读取 frame 标签的内容呢?
我正在使用 Python 2.5 和 BeautifulSoup 3.0.8。
如果能得到结果,我也愿意使用 Python 3.1 和 BeautifulSoup 3.1。
谢谢
2 个回答
1
Pyparsing可以帮助你处理JS和HTML的混合内容。这个解析器会寻找包含引号字符串的document.write
语句,或者是由多个引号字符串和标识符组成的字符串表达式。它会对这个字符串表达式进行半评估,解析出其中嵌入的<frame>
标签,并将框架的属性以pyparsing的ParseResults对象返回。这样你就可以像访问对象属性或字典键一样,方便地访问这些命名属性(你可以根据自己的喜好选择)。
jssrc = """
<script language="javascript">
.
.
.
document.write('<frame name="nav" src="/nav/index_nav.html" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" border = "no" noresize>');
if (anchor != "")
{ document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html?' + anchor + '" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>'); }
else
{ document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>'); }
document.write('</frameset>');
// end hiding -->
</script>"""
from pyparsing import *
# define some basic punctuation, and quoted string
LPAR,RPAR,PLUS = map(Suppress,"()+")
qs = QuotedString("'")
# use pyparsing helper to define an expression for opening <frame>
# tags, which includes support for attributes also
frameTag = makeHTMLTags("frame")[0]
# some of our document.write statements contain not a sting literal,
# but an expression of strings and vars added together; define
# an identifier expression, and add a parse action that converts
# a var name to a likely value
ident = Word(alphas).setParseAction(lambda toks: evalvars[toks[0]])
evalvars = { 'cusip' : "CUSIP", 'anchor' : "ANCHOR" }
# now define the string expression itself, as a quoted string,
# optionally followed by identifiers and quoted strings added
# together; identifiers will get translated to their defined values
# as they are parsed; the first parse action on stringExpr concatenates
# all the tokens; then the second parse action actually parses the
# body of the string as a <frame> tag and returns the results of parsing
# the tag and its attributes; if the parse fails (that is, if the
# string contains something that is not a <frame> tag), the second
# parse action will throw an exception, which will cause the stringExpr
# expression to fail
stringExpr = qs + ZeroOrMore( PLUS + (ident | qs))
stringExpr.setParseAction(lambda toks : ''.join(toks))
stringExpr.addParseAction(lambda toks:
frameTag.parseString(toks[0],parseAll=True))
# finally, define the overall document.write(...) expression
docWrite = "document.write" + LPAR + stringExpr + RPAR
# scan through the source looking for document.write commands containing
# <frame> tags using scanString; print the original source fragment,
# then access some of the attributes extracted from the <frame> tag
# in the quoted string, using either object-attribute notation or
# dict index notation
for dw,locstart,locend in docWrite.scanString(jssrc):
print jssrc[locstart:locend]
print dw.name
print dw["src"]
print
输出结果:
document.write('<frame name="nav" src="/nav/index_nav.html" marginwidth="0" marginheight="0" scrolling="no" frameborder="0" border = "no" noresize>')
nav
/nav/index_nav.html
document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html?' + anchor + '" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>')
body
http://content.members.fidelity.com/mfl/summary/0,,CUSIP,00.html?ANCHOR
document.write('<frame name="body" src="http://content.members.fidelity.com/mfl/summary/0,,' + cusip + ',00.html" marginwidth="0" marginheight="0" scrolling="auto" frameborder="0" noresize>')
body
http://content.members.fidelity.com/mfl/summary/0,,CUSIP,00.html
2
光靠BeautifulSoup是做不到的。BeautifulSoup解析HTML的方式就像浏览器接收到的那样(在任何重写或DOM操作之前),而且它不解析(更不用说执行)JavaScript。
在这种特殊情况下,你可能想用一个简单的正则表达式。