从文本页html-python中提取特定部分

2024-04-24 03:59:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我有追随者

html_source = """{"linkparam":"CDAQ46598omxw=","linkmetadata":{"weblinkmetadata":{"url":"/service_ajax","sendPost":true}},"formfield":{"action":"CAUaMVVnd2t2Z1htRGl3OXAtS0FVaUY0QWFBQkNRLjhtZmduZEgzWXI4OG1maDFJMjRiV0gwATgAShUxMDIwMTQzMTg0NzMxMTE4NzMxNzBaGFVDQjBkMEpMbjFXY0dZY3d3Wjg3ZDJMQXAA","clientActions":[{"formaction":{"voteCount":{"accessibility":{"accessibilityData":{"label":"11 status"}},"simpleText":"11"},"formstatus":"FORM"}}]}}
    #below  part i want to  extract from page including curly braces
    {"linkparam":"CDAQ46597omxw=","linkmetadata":{"weblinkmetadata":{"url":"/service_ajax","sendPost":true}},"formfield":{"action":"CAUaMVVnd2t2Z1htRGl3OXAtS0FVaUY0QWFBQkNRLjhtZmduZEgzWXI4OG1maDFJMjRiV0gwATgAShUxMDIwMTQzMTg0NzMxMTE4NzMxNzBaGFVDQjBkMEpMbjFXY0dZY3d3Wjg3ZDJMQXAA","clientActions":[{"formaction":{"voteCount":{"accessibility":{"accessibilityData":{"label":"11 status"}},"simpleText":"11"},"formstatus":"FORM"}}]}}
    #above  part i want to  extract from page including curly braces
    {"linkparam":"CDAQ46448omxw=","linkmetadata":{"weblinkmetadata":{"url":"/service_ajax","sendPost":true}},"formfield":{"action":"BQkNRLjhtZmduZEgzWXI4OG1maDFJMjRiV0gwATgAShUxMDIwMTQzMTg0NzMxMTE4NzMxNzBaGFVDQjBkMEpMbjFXY0dZY3d3Wjg3ZDJMQXAA","clientActions":[{"formaction":{"voteCount":{"accessibility":{"accessibilityData":{"label":"11 status"}},"simpleText":"11"},"formstatus":"FORM"}}]}}"""a


m = re.search(r"\{(.*?)\}", html_source)

我想从页面字符串中提取这部分

{"linkparam":"CDAQ46597omxw=","linkmetadata":{"weblinkmetadata":{"url":"/service_ajax","sendPost":true}},"formfield":{"action":"CAUaMVVnd2t2Z1htRGl3OXAtS0FVaUY0QWFBQkNRLjhtZmduZEgzWXI4OG1maDFJMjRiV0gwATgAShUxMDIwMTQzMTg0NzMxMTE4NzMxNzBaGFVDQjBkMEpMbjFXY0dZY3d3Wjg3ZDJMQXAA","clientActions":[{"formaction":{"voteCount":{"accessibility":{"accessibilityData":{"label":"11 status"}},"simpleText":"11"},"formstatus":"FORM"}}]}}

Tags: trueurlserviceajaxactionaccessibilityformfieldformaction
1条回答
网友
1楼 · 发布于 2024-04-24 03:59:46

您的数据看起来像是由注释分隔的json项的列表(以“#”开头的行)

因此,您可以用“,”替换注释,并用“[”和“]”包装数据,以创建一个json列表

import re

html_source = re.sub(r'#.*?\n', ',', html_source, flags=re.DOTALL)
html_source = '['+html_source+']'

然后,您可以使用json库来解析此项列表并提取第二个项:

import json
import pprint

data = json.loads(html_source)
pprint.pprint(data[2])

你会得到:

{'formfield': {'action': 'BQkNRLjhtZmduZEgzWXI4OG1maDFJMjRiV0gwATgAShUxMDIwMTQzMTg0NzMxMTE4NzMxNzBaGFVDQjBkMEpMbjFXY0dZY3d3Wjg3ZDJMQXAA',
               'clientActions': [{'formaction': {'formstatus': 'FORM',
                                                 'voteCount': {'accessibility': {'accessibilityData': {'label': '11 '
                                                                                                                'status'}},
                                                               'simpleText': '11'}}}]},
 'linkmetadata': {'weblinkmetadata': {'sendPost': True,
                                      'url': '/service_ajax'}},
 'linkparam': 'CDAQ46448omxw='}

如果您没有评论…

你可以做:

# insert ',' as a delimiter
html_source = html_source.replace('{"linkparam"', ', {"linkparame"')
# drop first comma
html_source = html_source[2:]
html_source = '[' + html_source + ']'

相关问题 更多 >