<p>对于一个非常简单的任务,即分析字符串,而不是解析它(解析=构建文本的树表示),您可以:</p>
<p>文本</p>
<pre><code>ss = '''
Humpty Dumpty sat on a wall
<div class="class1">
Stock Number:
Z2079
<br>
**VIN:
2T2HK31UX9C110701**
<br>
Model Code:
9424
<img class="imgcert" src="/images/Lexus_cpo.jpg">
</div>
Humpty Dumpty had a great fall
<ul cat="zoo">
Stock Number:
ARDEN3125
<br>
**VIN:
SHAKAMOSK-230478-UBUN**
</br>
Model Code:
101
<img class="imgcert" src="/images/Magana_cpo.jpg">
</ul>
All the king's horses and all the king's men
<artifice>
<baradino>
Stock Number:
DERT5178
<br>
**VIN:
Pandaia-67-Moro**
<br>
Model Code:
1234
<img class="imgcert" src="/images/Pertuis_cpo.jpg">
</baradino>
what what what who what
<somerset who="maugham">
Nothing to declare
</somerset>
</artifice>
Couldn't put Humpty Dumpty again
<ending rtf="simi">
Stock Number:
ZZZ789
<br>
**VIN:
0000012554-ENDENDEND**
<br>
Model Code:
QS78-9
<img class="imgcert" src="/images/Sunny_cpo.jpg">
</ending>
qsdjgqsjkdhfqjkdhgfjkqshgdfkjqsdjfkh'''
</code></pre>
<p>代码:</p>
^{pr2}$
<p>结果</p>
<pre><code>('div' , 'class="class1"' , '2T2HK31UX9C110701' )
('ul' , 'cat="zoo"' , 'SHAKAMOSK-230478-UBUN' )
('baradino' , '' , 'Pandaia-67-Moro' )
('ending' , 'rtf="simi"' , '0000012554-ENDENDEND' )
</code></pre>
<p><code>re.DOTALL</code>必须赋予点符号匹配换行符的能力(默认情况下,正则表达式模式中的点匹配除换行符之外的每个字符)</p>
<p><code>\\1</code>是指定在被检查字符串的这个位置,必须有第一个组捕获的字符串的相同部分,即<code>([^ >]+)</code></p>
<p><code>'(?!.+?<(?!br>)[^ >]+>.+?<br>.+?</\\1>)'</code>是一个部分,它规定禁止在HTML元素的开始标记和结束标记之间遇到的第一个标记<code><br></code>之前找到除<code><br></code>之外的标记。<br/>
这一部分是捕捉VIM-part前最近的前一个标记<code><br></code><br/>
如果此部分不存在,则正则表达式</p>
<pre><code>regx = re.compile('<([^ >]+) ?([^>]*)>'
'.*?\*\*VIN:(.+?)\*\*.+?</\\1>',
re.DOTALL)
</code></pre>
<p>捕获以下结果:</p>
<pre><code>('div' , 'class="class1"' , '2T2HK31UX9C110701' )
('ul' , 'cat="zoo"' , 'SHAKAMOSK-230478-UBUN' )
('artifice' , '' , 'Pandaia-67-Moro' )
('ending' , 'rtf="simi"' , '0000012554-ENDENDEND' )
</code></pre>
<p>区别在于“技巧”而不是“巴拉迪诺”</p>