用于XML文档的正则表达式

3条回答

网友

1楼 · 编辑于 2024-06-06 13:48:27

从你和爱德华的代码中学习。但我不建议您直接使用正则表达式解析XML

n = '4'
reg = '<[Rr]epresentation.*?[Bb]andwidth="(['+n+'-9]\d{6}|\d{8})[\d]*"[\s\S]*?</[Rr]epresentation>'

给出一个使用SimplifiedDoc的示例

from simplified_scrapy import SimplifiedDoc
html = '''Your xml'''
doc = SimplifiedDoc(html)
n = '4'
Representations = doc.selects('Representation|representation').containsReg('(['+n+'-9]\d{6}|\d{8})[\d]*',attr='bandwidth')
print(Representations)

结果:

[{'id': '3', 'mimeType': 'video/mp4', 'codecs': 'avc1.4d401f', 'width': '768', 'height': '432', 'frameRate': '24', 'sar': '1:1', 'startWithSAP': '1', 'bandwidth': '4000000', 'tag': 'Representation', 'html': '\n        <SegmentTemplate timescale="12288" duration="61440" media="BBB_768_1440K_video_$Number$.mp4" startNumber="1" initialization="BBB_768_1440K_video_init.mp4" />\n    '}]

网友

2楼 · 编辑于 2024-06-06 13:48:27

试试这个更健壮的RegEx

输入：
范围1-9

输出：
bw[0]包含从打开到关闭的整个元素
bw[2]包含带宽

>>> import re
>>>
>>> range = "2"
>>>
>>> regx = r"(?s)(<[Rr]epresentation(?=\s)(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?(?<=\s)[Bb]andwidth\s*=\s*(?:(['\"])\s*0*([" + \
...        range + \
...        r"-9]\d{6}|[1-9]\d{7,17})\s*\2))(?=(\s+(?:\".*?\"|'.*?'|[^>]*?)+>))\4(?<!/>).*?</[Rr]epresentation\s*>)"
>>>
>>> txt = """
...  <AdaptationSet segmentAlignment="true" maxWidth="1280" maxHeight="720" maxFrameRate="24" par="16:9">
...      <Representation id="1"
...         mimeType="video/mp4"
...         codecs="avc1.4d401f"
...         width="512"
...         height="288"
...         frameRate="24"
...         sar="1:1"
...         startWithSAP="1"
...         bandwidth="1000000">
...         <SegmentTemplate timescale="12288" duration="61440" media="BBB_512_640K_video_$Number$.mp4" startNumber="1" initialization="BBB_512_640K_video_init.mp4" />
...       </Representation>
...       <Representation id="2" mimeType="video/mp4" codecs="avc1.4d401f" width="512" height="288" frameRate="24" sar="1:1" startWithSAP="1" bandwidth="2000000">
...         <SegmentTemplate timescale="12288" duration="61440" media="BBB_512_640K_video_$Number$.mp4" startNumber="1" initialization="BBB_512_640K_video_init.mp4" />
...       </Representation>
...       <Representation id="3" mimeType="video/mp4" codecs="avc1.4d401f" width="768" height="432" frameRate="24" sar="1:1" startWithSAP="1" bandwidth="4000000">
...         <SegmentTemplate timescale="12288" duration="61440" media="BBB_768_1440K_video_$Number$.mp4" startNumber="1" initialization="BBB_768_1440K_video_init.mp4" />
...       </Representation>
...     </AdaptationSet>
... """
>>>
>>> bands = re.findall( regx, txt )
>>> for bw in bands:
...     print ( bw[2] + " : " )
...     print ( bw[0] )
...     print ( "" )
...
2000000 :
<Representation id="2" mimeType="video/mp4" codecs="avc1.4d401f" width="512" height="288" frameRate="24" sar="1:1" startWithSAP="1" bandwidth="2000000">
        <SegmentTemplate timescale="12288" duration="61440" media="BBB_512_640K_video_$Number$.mp4" startNumber="1" initialization="BBB_512_640K_video_init.mp4" />
      </Representation>

4000000 :
<Representation id="3" mimeType="video/mp4" codecs="avc1.4d401f" width="768" height="432" frameRate="24" sar="1:1" startWithSAP="1" bandwidth="4000000">
        <SegmentTemplate timescale="12288" duration="61440" media="BBB_768_1440K_video_$Number$.mp4" startNumber="1" initialization="BBB_768_1440K_video_init.mp4" />
      </Representation>

>>>

网友

3楼 · 编辑于 2024-06-06 13:48:27

更新

我推荐底部的ElementTree版本。但这里有一个要求的正则表达式版本：

import re

txt = """
 <AdaptationSet segmentAlignment="true" maxWidth="1280" maxHeight="720" maxFrameRate="24" par="16:9">
     <Representation id="1" 
        mimeType="video/mp4" 
        codecs="avc1.4d401f" 
        width="512" 
        height="288" 
        frameRate="24" 
        sar="1:1" 
        startWithSAP="1" 
        bandwidth="1000000">
        <SegmentTemplate timescale="12288" duration="61440" media="BBB_512_640K_video_$Number$.mp4" startNumber="1" initialization="BBB_512_640K_video_init.mp4" />
      </Representation>
      <Representation id="2" mimeType="video/mp4" codecs="avc1.4d401f" width="512" height="288" frameRate="24" sar="1:1" startWithSAP="1" bandwidth="2000000">
        <SegmentTemplate timescale="12288" duration="61440" media="BBB_512_640K_video_$Number$.mp4" startNumber="1" initialization="BBB_512_640K_video_init.mp4" />
      </Representation>
      <Representation id="3" mimeType="video/mp4" codecs="avc1.4d401f" width="768" height="432" frameRate="24" sar="1:1" startWithSAP="1" bandwidth="4000000">
        <SegmentTemplate timescale="12288" duration="61440" media="BBB_768_1440K_video_$Number$.mp4" startNumber="1" initialization="BBB_768_1440K_video_init.mp4" />
      </Representation>
    </AdaptationSet>
"""

input=2000000

reps = re.findall(r'<\s*representation(?:\s*\w+="[^"]*")*\s*>.*?<\/\s*representation\s*>',
    txt, flags=re.IGNORECASE + re.DOTALL)


for rep in reps:
    bandwidth = int(re.search(r'bandwidth="([^"]*)"', rep, flags=re.IGNORECASE).group(1))
    if (bandwidth > input):
        print(rep)

我认为通过两个步骤更容易做到：

把Representation一个接一个地分块。上面的正则表达式可以做到这一点，但是您可能会用类似[^>]*?>的简单内容替换属性匹配部分（非捕获组(?:\s*\w+="[^"]*")*\s*>中的部分），因为您只需要整个表示元素&；它的孩子。要分解整个正则表达式，请执行以下操作：
- <\s*-匹配<后跟0个或更多空格
- representation-显然匹配{}。IGNORECASE标志确保这与大小写变化相匹配
- (?:\s*\w+="[^"]*")*-这匹配形式blab_blah="value123"的零个或多个属性，包括它们周围的空格。(?:意味着它是一个非捕获组，因此之后不能通过pythongroup()方法使用它。它只是为了重复而存在，即零个或多个属性，或者(?:...)*。同样，因为这里不需要属性匹配，所以可以将其简化为类似[^>]*?>的内容，但它对我很有用
- \s*>-后跟>的空格
- .*?-元素中的一组内容（包括由于DOTALL标志而产生的换行符），但是反贪婪匹配，因此我们确保在遇到的第一个关闭标记处停止，并且不匹配后面的标记
- <\/\s*representation\s*>-close标记，带有可选空格
一旦我们有了每个“表示”元素，我们就可以将带宽提取到一个一流的python整数中，以便于与输入进行比较
根据带宽值进行过滤

我认为将带宽提取成整数并与输入进行比较比在正则表达式中进行整数比较更容易

还请注意，如果没有（或超过1个）带宽属性实例，则代码不会处理该属性。可能还有其他脆弱的方面

这是使用ElementTree的版本。这通常更好的原因是，您不需要依靠自己的能力来解析所有可能的XML语法组合的细节。使用库意味着他们已经想到了所有这些东西，而您需要匹配的只是一些小片段，比如元素和属性的名称，因此代码不太可能被破坏。但也许这是一个家庭作业问题或什么的

import xml.etree.ElementTree as ET

input = 4000
tree = ET.parse('content.xml')
root = tree.getroot()
nodes = [n for n in root.findall('Representation') if int(n.attrib['bandwidth']) >= input]
print(nodes)

相关问题更多 >

编程相关推荐

热门问题

热门文章

用于XML文档的正则表达式

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >