使用解析器替换所有IMG元素的SRC

10 投票

2 回答

9953 浏览

提问于 2025-04-15 15:07

我想找一种方法，来替换所有IMG标签中的SRC属性，但不想用正则表达式。（我希望使用Python默认安装中自带的任何现成的HTML解析器）我需要把源地址改成：

<img src="cid:imagename">

我想把所有的src标签替换成指向HTML邮件中附件的cid，所以我还需要把源地址改成仅仅是文件名，不带路径和扩展名。

html解析附件处理 img标签 src属性 cid

2 个回答

这里有一个使用pyparsing的方法来解决你的问题。你需要自己写代码来处理http的src属性。

from pyparsing import *
import urllib2

imgtag = makeHTMLTags("img")[0]

page = urllib2.urlopen("http://www.yahoo.com")
html = page.read()
page.close()

# print html

def modifySrcRef(tokens):
    ret = "<img"
    for k,i in tokens.items():
        if k in ("startImg","empty"): continue
        if k.lower() == "src":
            # or do whatever with this
            i = i.upper() 
        ret += ' %s="%s"' % (k,i)
    return ret + " />"

imgtag.setParseAction(modifySrcRef)

print imgtag.transformString(html)

这些标签会转换成：

<img src="HTTP://L.YIMG.COM/A/I/WW/BETA/Y3.GIF" title="Yahoo" height="44" width="232" alt="Yahoo!" />
<a href="r/xy"><img src="HTTP://L.YIMG.COM/A/I/WW/TBL/ALLYS.GIF" height="20" width="138" alt="All Yahoo! Services" border="0" /></a>

回答于 2025-04-15 由 Python大师

分享举报

在Python的标准库里有一个HTML解析器，但这个解析器并不太好用，而且从Python 2.6开始就不再推荐使用了。用BeautifulSoup来处理这些事情就简单多了：

from BeautifulSoup import BeautifulSoup
from os.path import basename, splitext
soup = BeautifulSoup(my_html_string)
for img in soup.findAll('img'):
    img['src'] = 'cid:' + splitext(basename(img['src']))[0]
my_html_string = str(soup)

回答于 2025-04-15 由 Python大师

分享举报

使用解析器替换所有IMG元素的SRC

2 个回答

撰写回答