如何在不实际解析JSON的情况下查找JSON对象(regex)

2024-04-19 09:05:07 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在过滤网站的数据并寻找关键词。该网站使用一个长JSON主体,我只需要在base64编码图像之前解析所有内容。我无法定期解析JSON对象,因为结构经常更改,有时会被切断

下面是我正在分析的一段代码:

<script id="__APP_DATA" type="application/json">{"routeProps":{"b723":{"navDataResource":[{"catalogId":48,"parentCatalogId":null,"icon":"https://bin.bnbstatic.com/image/20200609/bbjy2x.png","catalogName":"New Crypto Listings","total":762,"articles":[{"id":54572,"code":"0ef69e1d334c4d8c9ffbd088843bf2dd","title":"Binance Will List GYEN"},{"id":54548,"code":"e5607624f4614c3f9fd2562c8beb8660","title":"BTG, DEXE \u0026 SHIB Enabled on Binance Isolated Margin"},{"id":54394,"code":"a176d4cfd4c74a7fb8238e63d71c062a","title":"Binance Futures Will Launch USDT-Margined ICP Perpetual Contracts with Up to 25X Leverage"},{"id":54392,"code":"4fa91d953fd0484ab9a48cca0a41c192","title":"Binance Will Open Trading for Internet Computer (ICP)"},{"id":54382,"code":"33b6e8116ce54705ac89e898d1a05510","title":"Binance Will List Internet Computer (ICP)"}],"catalogs":[]},{"catalogId":49,"parentCatalogId":null,"icon":"https://bin.bnbstatic.com/image/20200609/zxgg2x.png","catalogName":"Latest News","total":1164,"articles":[{"id":54649,"code":"2291f02b964f45b195fd6d4685db80bb","title":"Update on Trading Suspension for GYEN"},{"id":54646,"code":"724346d139b041198a441dc149133c7d","title":"Binance Liquid Swap Adds RAMP/BUSD Liquidity Pool"},{"id":54643,"code":"bc9f313c04cc40d2b7e598c831fd721f","title":"Notice on Trading Suspension for GYEN"},{"id":54591,"code":"b3c6998066af43078c63a5498bfd80b1","title":"Binance P2P Supports New Payment Methods for Mongolia"},{"id":54586,"code":"d4418be0b9ea4d1b8e92cbbfe8468a17","title":"Dual Investment (42nd Phase) - Earn Up to 56% APY"}]

正如你所看到的,我正试图剔除一切,除了这些:

{"id":54382,"code":"33b6e8116ce54705ac89e898d1a05510","title":"Binance Will List Internet Computer (ICP)"}

由于JSON非常长,所以解析整个内容并不明智,有没有一种方法可以在不解析JSON对象的情况下找到这样的字符串?理想情况下,我希望所有东西都在一个数组中。正则表达式可以工作吗

ID有5个数字长,代码有32个字符长,还有一个标题

提前多谢


Tags: idjsonfortitle网站onbinancecode
2条回答

下面将使用string.find()逐步遍历该字符串,如果找到目标字符串的开头和结尾,则将其提取为字典。如果它只找到开始,而没有找到结束,那么它将假定它是一个断开或中断的字符串,并中断循环,因为没有其他事情要做

我正在使用ast模块将字符串转换为字典。这并不是严格地回答这个问题所需要的,但我认为它使最终结果更有用

import ast

testdata = '{"routeProps":{"b723":{"navDataResource":[{"catalogId":48,"parentCatalogId":null,"icon":"https://bin.bnbstatic.com/image/20200609/bbjy2x.png","catalogName":"New Crypto Listings","total":762,"articles":[{"id":54572,"code":"0ef69e1d334c4d8c9ffbd088843bf2dd","title":"Binance Will List GYEN"},{"id":54548,"code":"e5607624f4614c3f9fd2562c8beb8660","title":"BTG, DEXE \u0026 SHIB Enabled on Binance Isolated Margin"},{"id":54394,"code":"a176d4cfd4c74a7fb8238e63d71c062a","title":"Binance Futures Will Launch USDT-Margined ICP Perpetual Contracts with Up to 25X Leverage"},{"id":54392,"code":"4fa91d953fd0484ab9a48cca0a41c192","title":"Binance Will Open Trading for Internet Computer (ICP)"},{"id":54382,"code":"33b6e8116ce54705ac89e898d1a05510","title":"Binance Will List Internet Computer (ICP)"}],"catalogs":[]},{"catalogId":49,"parentCatalogId":null,"icon":"https://bin.bnbstatic.com/image/20200609/zxgg2x.png","catalogName":"Latest News","total":1164,"articles":[{"id":54649,"code":"2291f02b964f45b195fd6d4685db80bb","title":"Update on Trading Suspension for GYEN"},{"id":54646,"code":"724346d139b041198a441dc149133c7d","title":"Binance Liquid Swap Adds RAMP/BUSD Liquidity Pool"},{"id":54643,"code":"bc9f313c04cc40d2b7e598c831fd721f","title":"Notice on Trading Suspension for GYEN"},{"id":54591,"code":"b3c6998066af43078c63a5498bfd80b1","title":"Binance P2P Supports New Payment Methods for Mongolia"},{"id":54586,"code":"d4418be0b9ea4d1b8e92cbbfe8468a17","title":"Dual Investment (42nd Phase) - Earn Up to 56% APY"}]'

# Create a list to hold the dictionary objects
itemlist = []

# Create variable to keep track of our position in the string
strMarker = 0

#Neverending Loooooooooooooooooooooooooooooooop
while True:

    # Find occurrence of the beginning of a target string
    strStart = testdata.find('{"id":',strMarker)
    if not strStart == -1:
        
        # If we've found the start, now look for the end marker of the string,
        # starting from the location we identified as the beginning of that string
        strEnd = testdata.find('}', strStart)
        
        # If it does not exist, this suggests it might be an interrupted string
        # so we don't do anything further with it, just allow the loop to break
        if not strEnd == -1:

            # Save this marker as it will be used as the starting point
            # for the next search cycle.
            strMarker = strEnd

            # Extract the substring based on the start and end positions, +1 to capture
            # the final '}'; as this string is nicely formatted as a dictionary object
            # already, we are using ast.literal_eval() to turn it into an actual usable
            # dictionary object
            itemlist.append(ast.literal_eval(testdata[strStart:strEnd+1]))

            # We're happy to keep searching so jump to the next loop
            continue

    # If nothing happened to trigger a jump to the next loop, break out of the
    # while loop
    break

# Print out the first entry in the list as a demo
print(str(itemlist[0]))
print(str(itemlist[0]["title"]))

此代码的输出应该是格式良好的dict:

{"id":54572,"code":"0ef69e1d334c4d8c9ffbd088843bf2dd","title":"Binance Will List GYEN"}
Binance Will List GYEN

正则表达式应该在这里工作。尝试与以下正则表达式匹配。当我在https://regexr.com/中尝试它时,它与所需的部分匹配。此外,regexr还可以帮助您理解正则表达式,以防您不熟悉它

(\{"id":\d{5},"code":".{32}","title":"[^"]*"\})

下面是一个小样本python脚本,用于查找所有部分

import re

pattern='(\{"id":\d{5},"code":".{32}","title":"[^"]*"\})'
string_to_parse='...'
sections = re.findall(pattern, string_to_parse, re.DOTALL)

相关问题 更多 >