用BeautifulSoup从JavaScript(编码)变量中抓取数据
我正在抓取一个网页,但无法获取某个字段,因为它存储在一个javascript变量里。
我的问题是,如何抓取以下代码,解码它,并保存<li>
标签的内容?我想使用BeautifulSoup和其他Python工具。
这里是<script>
标签里的代码:
<script type="text/javascript">
var html_audition_details_sidebar = ' \u003Cdiv id\u003D\u0022apply_wrapper\u0022\u003E \u003Cdiv class\u003D\u0022header\u0022\u003E \u003Cp\u003EAudition Information\u003C/p\u003E \u003C/div\u003E \u003Cdiv class\u003D\u0022text \u0022\u003E \u003Cdiv class\u003D\u0022roleContainer \u0022 style\u003D\u0022color: #999\u003B font\u002Dsize: 14px\u003B\u0022\u003E \u003Cp\u003EOnly official members can see audition information for this job\u003C/p\u003E \u003C/div\u003E \u003C/div\u003E \u003Cdiv class\u003D\u0022applyButton\u0022\u003E \u003Cp\u003E\u003Ca class\u003D\u0022applyLink\u0022 href\u003D\u0022/accounts/login/apply/41680/\u0022\u003ESubscribe Now \u003C/a\u003E\u003C/p\u003E \u003C/div\u003E \u003C/div\u003E';
var html_additional_requirements = '';
var html_role_listing = '\u003Cdiv class\u003D\u0022text callListing loggedout \u0022\u003E \u003Cp class\u003D\u0022title\u0022\u003E\u003Ca name\u003D\u0022roles\u0022\u003E\u003C/a\u003ESeeking Talent \u003Cspan class\u003D\u0022optional\u0022\u003ESelect a role below for more information and submission instructions.\u003C/span\u003E\u003C/p\u003E \u003Cdiv class\u003D\u0022castingRoles\u0022\u003E \u003Cul\u003E \u003Cli \u003E \u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/martinique\u002D159296/\u0022\u003E Martinique (Lead): \u003Cspan class\u003D\u0022roletag\u0022\u003E Female, 18\u002D25, Caucasian \u003C/span\u003E \u003Cspan class\u003D\u0022applyNow\u0022\u003E \u003C/span\u003E \u003C/a\u003E \u003Cp class\u003D\u0022role\u002Ddesc\u0022 style\u003D\u0022border\u002Dbottom: none\u003B padding\u002Dbottom: 0px\u003B margin\u002Dbottom: 0px\u003B\u0022\u003E native French speaker. \u003C/p\u003E \u003C/li\u003E \u003Cli \u003E \u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/justin\u002D159297/\u0022\u003E Justin (Lead): \u003Cspan class\u003D\u0022roletag\u0022\u003E Male, 20\u002D25, All Ethnicities \u003C/span\u003E \u003Cspan class\u003D\u0022applyNow\u0022\u003E \u003C/span\u003E \u003C/a\u003E \u003Cp class\u003D\u0022role\u002Ddesc\u0022 style\u003D\u0022border\u002Dbottom: none\u003B padding\u002Dbottom: 0px\u003B margin\u002Dbottom: 0px\u003B\u0022\u003E comedy and improv skills, hopeless romantic. \u003C/p\u003E \u003C/li\u003E \u003Cli \u003E \u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/flower\u002Dshop\u002Dsalesperson\u002D159299/\u0022\u003E Flower Shop Salesperson : \u003Cspan class\u003D\u0022roletag\u0022\u003E Males \u0026amp\u003B Females, 30+, All Ethnicities \u003C/span\u003E \u003Cspan class\u003D\u0022applyNow\u0022\u003E \u003C/span\u003E \u003C/a\u003E \u003Cp class\u003D\u0022role\u002Ddesc\u0022 style\u003D\u0022border\u002Dbottom: none\u003B padding\u002Dbottom: 0px\u003B margin\u002Dbottom: 0px\u003B\u0022\u003E impatient. \u003C/p\u003E \u003C/li\u003E \u003Cli \u003E \u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/models\u002D159300/\u0022\u003E Models (Supporting): \u003Cspan class\u003D\u0022roletag\u0022\u003E Female, 18\u002D35, All Ethnicities \u003C/span\u003E \u003Cspan class\u003D\u0022applyNow\u0022\u003E \u003C/span\u003E \u003C/a\u003E \u003Cp class\u003D\u0022role\u002Ddesc\u0022 style\u003D\u0022border\u002Dbottom: none\u003B padding\u002Dbottom: 0px\u003B margin\u002Dbottom: 0px\u003B\u0022\u003E small roles, under five lines. \u003C/p\u003E \u003C/li\u003E \u003C/ul\u003E \u003C/div\u003E\u003C/div\u003E';
</script>
格式很糟糕,抱歉。
我需要保存所有li
标签中,仅属于变量html_role_listing
的链接。
举个例子:
把这个:
\u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/martinique\u002D159296/\u0022\u003E
变成这个:
/casting/untitled-comedy-short-41680/martinique-159296/
提前感谢你的帮助!
2 个回答
0
不需要JS解析器!
我假设你知道你的<script>
标签里的内容会像你给的例子那样格式化,并且你已经把这个标签的内容提取到一个叫script_text
的变量里了。
首先,我们需要获取html_role_listing
的值,这可以通过一个老办法——正则表达式来实现:
>>> import re
>>> html_role_listing_match = re.search(r'var html_role_listing = \'(.+)\';$', script_text, re.MULTILINE)
>>> html_role_listing = html_role_listing_match.group(1)
接着,我们利用\u003C
和类似的转义序列在Python的Unicode字符串中也是有效的(就像在JS字符串中一样),然后用更安全的eval
来解析它们:
>>> import ast
>>> roles_html = ast.literal_eval("u'%s'" % html_role_listing)
为了让你自己确认,你可以检查这个字符串的前几个字符,看看它们是否被正确解析:
>>> print roles_html[:10]
<div class
现在,你有了一个可以用BeautifulSoup解析的HTML字符串:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(roles_html)
然后可以提取这些链接的href
属性。
>>> links = soup.select('li a')
>>> for link in links:
... print link.attrs['href']
...
/casting/untitled-comedy-short-41680/martinique-159296/
/casting/untitled-comedy-short-41680/justin-159297/
/casting/untitled-comedy-short-41680/flower-shop-salesperson-159299/
/casting/untitled-comedy-short-41680/models-159300/
0
如果你想以一种通用的方式来处理这个问题,你需要一个可以解析JavaScript的库。在这个例子中,我会用slimit
来实现。
首先,加载你的数据:
from bs4 import BeautifulSoup as Soup
import slimit
from slimit.parser import Parser
from slimit.visitors import nodevisitor
a = """<script type="text/javascript">
var html_audition_details_sidebar = ' \u003Cdiv id\u003D\u0022apply_wrapper\u0022\u003E \u003Cdiv class\u003D\u0022header\u0022\u003E \u003Cp\u003EAudition Information\u003C/p\u003E \u003C/div\u003E \u003Cdiv class\u003D\u0022text \u0022\u003E \u003Cdiv class\u003D\u0022roleContainer \u0022 style\u003D\u0022color: #999\u003B font\u002Dsize: 14px\u003B\u0022\u003E \u003Cp\u003EOnly official members can see audition information for this job\u003C/p\u003E \u003C/div\u003E \u003C/div\u003E \u003Cdiv class\u003D\u0022applyButton\u0022\u003E \u003Cp\u003E\u003Ca class\u003D\u0022applyLink\u0022 href\u003D\u0022/accounts/login/apply/41680/\u0022\u003ESubscribe Now \u003C/a\u003E\u003C/p\u003E \u003C/div\u003E \u003C/div\u003E';
var html_additional_requirements = '';
var html_role_listing = '\u003Cdiv class\u003D\u0022text callListing loggedout \u0022\u003E \u003Cp class\u003D\u0022title\u0022\u003E\u003Ca name\u003D\u0022roles\u0022\u003E\u003C/a\u003ESeeking Talent \u003Cspan class\u003D\u0022optional\u0022\u003ESelect a role below for more information and submission instructions.\u003C/span\u003E\u003C/p\u003E \u003Cdiv class\u003D\u0022castingRoles\u0022\u003E \u003Cul\u003E \u003Cli \u003E \u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/martinique\u002D159296/\u0022\u003E Martinique (Lead): \u003Cspan class\u003D\u0022roletag\u0022\u003E Female, 18\u002D25, Caucasian \u003C/span\u003E \u003Cspan class\u003D\u0022applyNow\u0022\u003E \u003C/span\u003E \u003C/a\u003E \u003Cp class\u003D\u0022role\u002Ddesc\u0022 style\u003D\u0022border\u002Dbottom: none\u003B padding\u002Dbottom: 0px\u003B margin\u002Dbottom: 0px\u003B\u0022\u003E native French speaker. \u003C/p\u003E \u003C/li\u003E \u003Cli \u003E \u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/justin\u002D159297/\u0022\u003E Justin (Lead): \u003Cspan class\u003D\u0022roletag\u0022\u003E Male, 20\u002D25, All Ethnicities \u003C/span\u003E \u003Cspan class\u003D\u0022applyNow\u0022\u003E \u003C/span\u003E \u003C/a\u003E \u003Cp class\u003D\u0022role\u002Ddesc\u0022 style\u003D\u0022border\u002Dbottom: none\u003B padding\u002Dbottom: 0px\u003B margin\u002Dbottom: 0px\u003B\u0022\u003E comedy and improv skills, hopeless romantic. \u003C/p\u003E \u003C/li\u003E \u003Cli \u003E \u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/flower\u002Dshop\u002Dsalesperson\u002D159299/\u0022\u003E Flower Shop Salesperson : \u003Cspan class\u003D\u0022roletag\u0022\u003E Males \u0026amp\u003B Females, 30+, All Ethnicities \u003C/span\u003E \u003Cspan class\u003D\u0022applyNow\u0022\u003E \u003C/span\u003E \u003C/a\u003E \u003Cp class\u003D\u0022role\u002Ddesc\u0022 style\u003D\u0022border\u002Dbottom: none\u003B padding\u002Dbottom: 0px\u003B margin\u002Dbottom: 0px\u003B\u0022\u003E impatient. \u003C/p\u003E \u003C/li\u003E \u003Cli \u003E \u003Ca href\u003D\u0022/casting/untitled\u002Dcomedy\u002Dshort\u002D41680/models\u002D159300/\u0022\u003E Models (Supporting): \u003Cspan class\u003D\u0022roletag\u0022\u003E Female, 18\u002D35, All Ethnicities \u003C/span\u003E \u003Cspan class\u003D\u0022applyNow\u0022\u003E \u003C/span\u003E \u003C/a\u003E \u003Cp class\u003D\u0022role\u002Ddesc\u0022 style\u003D\u0022border\u002Dbottom: none\u003B padding\u002Dbottom: 0px\u003B margin\u002Dbottom: 0px\u003B\u0022\u003E small roles, under five lines. \u003C/p\u003E \u003C/li\u003E \u003C/ul\u003E \u003C/div\u003E\u003C/div\u003E';
</script>"""
soup = Soup(a)
js_content = soup.findAll('script')[0].text
现在你可以选择:用正则表达式来解析js_content
,或者使用一个词法分析器(在这个例子中就是slimit)。
par = Parser()
tree = par.parse(js_content)
encoded_html = [n.value for n in nodevisitor.visit(tree) if isinstance(n, slimit.ast.String)][2]
最后,你可以轻松地解码那个字符串,它是有效的HTML:
decoded_html = encoded_html.decode('unicode_escape')
print(decoded_html)
所以重新解析这个HTML:
role_listing = Soup(decoded_html)
output = [ anchor.attrs['href'] for anchor in role_listing.select('li a') ]
print('---')
print("\n".join(output))
输出看起来是这样的:
'<div class="text callListing loggedout "> <p class="title"><a name="roles"></a>Seeking Talent <span class="optional">Select a role below for more information and submission instructions.</span></p> <div class="castingRoles"> <ul> <li > <a href="/casting/untitled-comedy-short-41680/martinique-159296/"> Martinique (Lead): <span class="roletag"> Female, 18-25, Caucasian </span> <span class="applyNow"> </span> </a> <p class="role-desc" style="border-bottom: none; padding-bottom: 0px; margin-bottom: 0px;"> native French speaker. </p> </li> <li > <a href="/casting/untitled-comedy-short-41680/justin-159297/"> Justin (Lead): <span class="roletag"> Male, 20-25, All Ethnicities </span> <span class="applyNow"> </span> </a> <p class="role-desc" style="border-bottom: none; padding-bottom: 0px; margin-bottom: 0px;"> comedy and improv skills, hopeless romantic. </p> </li> <li > <a href="/casting/untitled-comedy-short-41680/flower-shop-salesperson-159299/"> Flower Shop Salesperson : <span class="roletag"> Males & Females, 30+, All Ethnicities </span> <span class="applyNow"> </span> </a> <p class="role-desc" style="border-bottom: none; padding-bottom: 0px; margin-bottom: 0px;"> impatient. </p> </li> <li > <a href="/casting/untitled-comedy-short-41680/models-159300/"> Models (Supporting): <span class="roletag"> Female, 18-35, All Ethnicities </span> <span class="applyNow"> </span> </a> <p class="role-desc" style="border-bottom: none; padding-bottom: 0px; margin-bottom: 0px;"> small roles, under five lines. </p> </li> </ul> </div></div>'
---
/casting/untitled-comedy-short-41680/martinique-159296/
/casting/untitled-comedy-short-41680/justin-159297/
/casting/untitled-comedy-short-41680/flower-shop-salesperson-159299/
/casting/untitled-comedy-short-41680/models-159300/