使用Python正则表达式将<ul...>和<li...>标签分别替换为<ul>和<li>

0 投票

2 回答

1230 浏览

提问于 2025-04-17 22:04

你好，我想用Python的正则表达式来删除所有的<ul>和<li>标签中的属性。下面是我的源字符串：

peanut butter1
<ul id="ul0002" list-style="none">peanut butter2
    <li id="ul0002-0001" num="0000">2.0 to 6.0 mg of 17&#x3b2;-estradiol and</li>
    <li id="ul0002-0002" num="0000">0.020 mg of ethinylestradiol;</li>
    <li id="ul0002-0003" num="0000">0.25 to 0.30 mg of drospirenone and</li>
    <li id="ul0002-0004" num="0000">0.1 to 0.2 mg of cyproterone acetate,</li>peanut butter3
</ul>

我想要的输出结果是：

peanut butter1
<ul>peanut butter2
    <li>2.0 to 6.0 mg of 17&#x3b2;-estradiol and</li>
    <li>0.020 mg of ethinylestradiol;</li>
    <li>0.25 to 0.30 mg of drospirenone and</li>
    <li>0.1 to 0.2 mg of cyproterone acetate,</li>peanut butter3
</ul>

正则表达式字符串替换数据清洗 HTML标签处理

2 个回答

试试这个：

    >>> xs='<li id="ul0002-0001" num="0000">2.0 to 6.0 mg of 17&#x3b2;-estradiol and</li>'
    >>> p=r'(<li|<ul|</ul)[^>]*(>)(.*)'
    >>> match=re.search(p,xs)
    >>> ''.join([match.group(1),match.group(2),match.group(3)])
        '<li>2.0 to 6.0 mg of 17&#x3b2;-estradiol and</li>'
    >>> xs='<ul id="ul0002" list-style="none">'
    >>> match=re.search(p,xs)
    >>> ''.join([match.group(1),match.group(2),match.group(3)])
        '<ul>'

回答于 2025-04-17 由 Python大师

分享举报

import re
for line in open('sample.html'):
    print re.sub('<(ul|li)[^>]*>', r'<\1>', line, flags=re.I),

上面的代码会从所有的

标签中移除属性，不管这些标签是一行一个还是一行多个。而且，由于使用了re.I，搜索时不区分大小写，所以像<UL...这样的标签也会被找到并移除它们的属性。标签外的文本不会受到影响。

根据你（修改过的）示例html，上面的代码会产生：
```
peanut butter1
<ul>peanut butter2
    <li>2.0 to 6.0 mg of 17&#x3b2;-estradiol and</li>
    <li>0.020 mg of ethinylestradiol;</li>
    <li>0.25 to 0.30 mg of drospirenone and</li>
    <li>0.1 to 0.2 mg of cyproterone acetate,</li>peanut butter3
</ul>
```
一次性处理整个文件

如果数据不是太长，可以一次性处理，而不是一行一行地处理：
```
import re
string = open('sample.html').read()
string = re.sub('<(ul|li)[^>]*>', r'<\1>', string, flags=re.I)
print string
```

回答于 2025-04-17 由 Python大师

分享举报

使用Python正则表达式将<ul...>和<li...>标签分别替换为<ul>和<li>

2 个回答

一次性处理整个文件

撰写回答