如何使用Python从HTML数据中提取JSON数据？

{ "vpn_detail": { "username":"harnishs", "tokens": [ "85188605", "00422786", ], "cluster_name":"*******.com" } }

b'  { "vpn_detail": { "username":"harnishs&q;= uot;, "tokens": = ; [ = ;"85188605", = ;"00422786", = ;"94548619", = ; ], "cluster_name":"***********.com" } } '

1条回答

网友

1楼 · 发布于 2024-05-16 14:36:31

使用html2text库可以大大简化您的任务，它几乎可以完成所有的工作，您只需删除不必要的标点符号，并用实数"替换乱引号：

import re, json, html2text

MyStr = b'<html>\r\n<head>\r\n<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-=\r\n1">\r\n<style type=3D"text/css" style=3D"display:none;"><!  P {margin-top:0;margi=\r\nn-bottom:0;}  ></style>\r\n</head>\r\n<body dir=3D"ltr">\r\n<div id=3D"divtagdefaultwrapper" dir=3D"ltr" style=3D"font-size: 12pt; colo=\r\nr: rgb(0, 0, 0); font-family: Calibri, Helvetica, sans-serif, &quot;EmojiFo=\r\nnt&quot;, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;, NotoCo=\r\nlorEmoji, &quot;Segoe UI Symbol&quot;, &quot;Android Emoji&quot;, EmojiSymb=\r\nols;">\r\n<p style=3D"margin-top:0; margin-bottom:0"></p>\r\n<div>\r\n<div>{<br>\r\n&quot;vpn_detail&quot;:<br>\r\n&nbsp;&nbsp; &nbsp;{<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&quot;username&quot;:&quot;kushpate&q=\r\nuot;,&nbsp;&nbsp; &nbsp;<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&quot;tokens&quot;:&nbsp;&nbsp; &nbsp=\r\n;<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;[<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;85188605&quot;,<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;00422786&quot;,<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;94548619&quot;,<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;51249494&quot;,<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;HHEF0EA5&quot;,<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp=\r\n;&quot;2E09A81E&quot;<br>\r\n&nbsp; &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;],<br>\r\n&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&quot;cluster_name&quot;:&quot;bgl13-=\r\nvpn-cluster-2.cisco.com&quot;<br>\r\n&nbsp;&nbsp; &nbsp;}<br>\r\n<br>\r\n}</div>\r\n</div>\r\n<br>\r\n<p></p>\r\n</div>\r\n</body>\r\n</html>\r\n'
MyStrTxt = html2text.html2text(MyStr.decode("utf8"))
clean_string = re.sub(r'(&q;=\s*uot;)|=\s*;\s*', lambda x: '"' if x.group(1) else '', MyStrTxt)
js = json.loads(clean_string)
print(js['vpn_detail']['username']) 
# => 'kushpate'

注意事项：

您的输入字符串是一个字节字符串，您需要将其转换为Unicode UTF8字符串，因此，MyStr.decode("utf8")是必需的
html2text.html2text(MyStr.decode("utf8"))将从字符串中清除HTML，您将立即获得JSON
re.sub(r'(&q;=\s*uot;)|=\s*;\s*', lambda x: '"' if x.group(1) else '', MyStrTxt)删除所有出现在{}之间的空格（如果有）或将用实数&q;=+零个或多个空白+uot;替换为实数{}。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用Python从HTML数据中提取JSON数据？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >