Python重新命名了一个贪婪匹配的组

2024-05-14 09:38:05 发布

您现在位置:Python中文网/ 问答频道 /正文

出于某些原因,我需要使用pythonre提取xml文档中的字段。在

下面是我将应用正则表达式的字符串示例:

payload2 = '<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" clientIP="172.26.64.233" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>'

您在上面看到的一些字段,如“clientIP”可能并不总是存在的。在

我想出的正则表达式是:

^{pr2}$

输出:

{'path': '\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db', 'client_ip': None, 'event_code': '0x80'}

但是当我把{1}而不是{0, 1}放在(?P<client_ip>[\S\s]+?)")之后,它就起作用了。然而,当clientIP不存在时,这种方法就失败了。在

有什么想法可以让regex在字段存在或不存在的两种情况下工作?在


Tags: patheventdbcheckthumbsc2engineeringbenchmarking
2条回答

我的建议是:

别再做一个大的单行正则表达式了。在

分解代码非常简单,这样不仅更具可读性,而且也更容易。在

我的代码版本

payloads = [
    '<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>',
    '<CheckEventRequest><EventList count="1"><Event event="0x80" path="\\c2_emcvnx.ntaplion.prv\\CHECK$\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db" flag="0x2" protocol="0" server="C2_EMCVNX" share="demoshare1" clientIP="172.26.64.233" serverIP="172.26.64.225" timeStamp="0x536C47EF0003C836" userSid="S-1-5-21-665520413-3518186362-2792099713-500" ownerSid="S-1-5-21-665520413-3518186362-2792099713-500" fileSize="0x11000" desiredAccess="0x0" createDispo="0x0" ntStatus="0x0" relativePath="\\C2_EMCVNX\\demoshare1\\Engineering\\Benchmarking\\Thumbs.db"/></EventList></CheckEventRequest>'
]


def scrape_xml(payload):
    import re
    ipv4 = r'clientIP="(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"'

    pat3 = dict()
    pat3['event_code'] = r'event="(0[xX][0-9a-fA-F]+?)"'
    pat3['path'] = r'path="(.*?)"'
    pat3['client_ip'] = ipv4

    matches = {}
    for index, regex in enumerate(pat3):
        matches[index] = re.search(
            pattern=pat3[regex],
            string=payload,
            flags=re.UNICODE
        )

    for index in matches:
        if not index:
            print "\n"
        if matches[index] is None:
            pass
        else:
            print matches[index].group(0)

for p in payloads:
    scrape_xml(p)

输出:

path="\c2_emcvnx.ntaplion.prv\CHECK$\demoshare1\Engineering\Benchmarking\Thumbs.db"
event="0x80"

path="\c2_emcvnx.ntaplion.prv\CHECK$\demoshare1\Engineering\Benchmarking\Thumbs.db"
clientIP="172.26.64.233"
event="0x80"

首先,我必须给你the standard warning against parsing XML with regular expressions,但是如果你在这个问题上固执己见

您可能不想使用[\S\s],因为它可以匹配任何内容,包括跳过引号。为了防止这种情况的发生,您将其设为非贪心的,但是还有一个更好的解决方案:只需使其不匹配引号(not match quotes):[^"]。还请注意,您可以将{0,1}替换为?。在

相关问题 更多 >

    热门问题