import re
import json
from bs4 import BeautifulSoup as bs
import requests
# Setup.
site = 'http://www.some-site.com/page'
exp = '^[\n\s]+sessionStorage.setItem\(.*JSON.stringify\((?P<content>{.*})\)\)'
r = requests.get(site)
if r.status_code == 200:
soup = bs(r.text)
# Extract all <script> tags from the full HTML.
scripts = soup.findAll('script')
# Loop through all <script> tags until sessionStorage is found.
script = [s.string for s in scripts if 'sessionStorage' in s.decode()]
# Use regex (with a named capture group) to extract the JSON data.
m = re.match(exp, script[0])
if m:
content = m['content']
# Convert scraped JSON data to a dict.
data = json.loads(content)
IIUC:编写以下代码是为了将
sessionStorage
属性值从网页提取到Pythondict
注意:regex模式可能需要修改,以适合您(用户)的特定用例
TL;博士(背景):
我在寻找上述代码更优雅的解决方案时遇到了这个问题
在我的例子中,我正在为一个站点编写单元测试,需要从一个特定的网页获取sessionStorage属性,以测试它是否包含预期的元素。由于数据是JSON格式的,因此此代码提取JSON数据并转换为Python
dict
以供检查相关问题 更多 >
编程相关推荐