如何在Python/Django中进行HTML编码解码？

171 投票

15 回答

241149 浏览

提问于 2025-04-11 17:59

我有一个字符串，它是经过HTML编码的：

'''&lt;img class=&quot;size-medium wp-image-113&quot;\
 style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot;\
 src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot;\
 alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;'''

我想把它改成：

<img class="size-medium wp-image-113" style="margin-left: 15px;" 
  title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" 
  alt="" width="300" height="194" />

我希望这个字符串能被当作HTML来处理，这样浏览器就能把它显示成图片，而不是当作文本来显示。

这个字符串之所以是这样的格式，是因为我在使用一个叫做 BeautifulSoup 的网页抓取工具，它会“扫描”网页并获取特定的内容，然后以这种格式返回字符串。

我已经找到了在 C# 中怎么做，但在 Python 中还没找到。有没有人能帮帮我？

15 个回答

在处理HTML编码时，可以使用标准库里的cgi.escape这个工具：

>> help(cgi.escape)
cgi.escape = escape(s, quote=None)
    Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
    is also translated.

而在进行HTML解码时，我会使用下面的代码：

import re
from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39

def unescape(s):
    "unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
    return re.sub('&(%s);' % '|'.join(name2codepoint),
              lambda m: unichr(name2codepoint[m.group(1)]), s)

如果遇到更复杂的情况，我会用BeautifulSoup这个库。

回答于 2025-04-11 由 Python大师

分享举报

172

使用标准库：

HTML 转义

try:
    from html import escape  # python 3.x
except ImportError:
    from cgi import escape  # python 2.x

print(escape("<"))

HTML 反转义

try:
    from html import unescape  # python 3.4+
except ImportError:
    try:
        from html.parser import HTMLParser  # python 3.x (<3.4)
    except ImportError:
        from HTMLParser import HTMLParser  # python 2.x
    unescape = HTMLParser().unescape

print(unescape("&gt;"))

回答于 2025-04-11 由 Python大师

分享举报

142

在Django的使用场景中，有两个答案可以参考。这里有一个关于django.utils.html.escape函数的例子，供你参考：

def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&l
t;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))

要反转这个过程，Jake的回答中提到的Cheetah函数应该可以用，但它缺少了单引号。这一版本更新了元组，替换的顺序也进行了调整，以避免对称性的问题：

def html_decode(s):
    """
    Returns the ASCII decoded version of the given HTML string. This does
    NOT remove normal HTML tags like <p>.
    """
    htmlCodes = (
            ("'", '&#39;'),
            ('"', '&quot;'),
            ('>', '&gt;'),
            ('<', '&lt;'),
            ('&', '&amp;')
        )
    for code in htmlCodes:
        s = s.replace(code[1], code[0])
    return s

unescaped = html_decode(my_string)

不过，这并不是一个通用的解决方案；它只适用于用django.utils.html.escape编码的字符串。更一般来说，最好还是使用标准库：

# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)

作为一个建议：将HTML以未转义的形式存储在数据库中可能更合理。如果可能的话，值得考虑从BeautifulSoup获取未转义的结果，这样可以完全避免这个过程。

在Django中，转义只发生在模板渲染的时候；所以要防止转义，你只需要告诉模板引擎不要转义你的字符串。要做到这一点，可以在你的模板中使用以下选项：

{{ context_var|safe }}
{% autoescape off %}
    {{ context_var }}
{% endautoescape %}

回答于 2025-04-11 由 Python大师

分享举报

如何在Python/Django中进行HTML编码解码？

相关链接

15 个回答

撰写回答