在Python字符串中解码HTML实体？

网友

1楼 · 编辑于 2024-04-20 06:19:58

靓汤处理实体转换。在BeautifulSoup3中，需要指定convertEntities构造函数的BeautifulSoup参数（参见归档文档的'Entity Conversion'部分）。在靓汤4中，实体被自动解码。

靓汤3

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>", 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>£682m</p>

靓汤4

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>")
<html><body><p>£682m</p></body></html>

网友

2楼 · 编辑于 2024-04-20 06:19:58

Python3.4+

使用^{}：

import html
print(html.unescape('&pound;682m'))

仅供参考html.parser.HTMLParser.unescape已被弃用，并且was supposed to be removed in 3.5，尽管它被错误地保留在中。它很快就会从语言中删除。

Python2.6-3.3

您可以使用标准库中的HTMLParser.unescape()：

对于Python 2.6-2.7，它位于^{}
对于Python 3，它位于^{}

>>> try:
...     # Python 2.6-2.7 
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
... 
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

您还可以使用^{}兼容库来简化导入：

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

网友

3楼 · 编辑于 2024-04-20 06:19:58

可以使用w3lib.html库中的replace_实体

In [202]: from w3lib.html import replace_entities

In [203]: replace_entities("&pound;682m")
Out[203]: u'\xa3682m'

In [204]: print replace_entities("&pound;682m")
£682m

靓汤3

靓汤4

Python3.4+

Python2.6-3.3

相关问题更多 >

编程相关推荐

热门问题

热门文章

在Python字符串中解码HTML实体？

靓汤3

靓汤4

Python3.4+

Python2.6-3.3

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >