使用python编辑html，但是lxml将好的html实体转换为奇怪的编码

from BeautifulSoup import BeautifulSoup, Comment soup = BeautifulSoup(str(desc[desc_type])) comments = soup.findAll(text=lambda text:isinstance(text, Comment)) [comment.extract() for comment in comments] print soup

3条回答

网友

1楼 · 编辑于 2024-06-16 14:19:09

我假设应该是引号。字节值为146的str对象是一个引号：

In [46]: print(chr(146).decode('cp1252'))
’

所以，你可以这样做：

import lxml.html.clean as clean
import re

html = "<div><!-- word style><bleep><omgz 1,000 tags><--><p>It&#146;s a spicy meatball!</div>"

html=re.sub('&#(\d+);',lambda m: chr(int(m.group(1))).decode('cp1252'),html)
print(html)
# <div><!-- word style><bleep><omgz 1,000 tags><--><p>It’s a spicy meatball!</div>
print(type(html))
# <type 'unicode'>
print(clean.clean_html(html))
# <div><p>It’s a spicy meatball!</p></div>

或者

doc=lh.fromstring(html)
clean.clean(doc)

请注意，引号具有unicode码位值8217。也就是说，ord(chr(146).decode('cp1252'))等于8217，所以lh.tostring返回：

print(lh.tostring(doc))
# <div><p>It&#8217;s a spicy meatball!</p></div>

你可以在cp1252中重新编码如下：

print(repr(lh.tostring(doc,encoding='cp1252')))
# '<div><p>It\x92s a spicy meatball!</p></div>'

我不知道怎么哄lxml回来

'<div><p>It&#146;s a spicy meatball!</p></div>'

但是，要匹配美化组代码的输出。很明显，这可以用regex来实现（与我前面所做的相反），但我不知道这是必要的还是可取的，因为lxml应该已经返回了其他应用程序可以理解的html。

result=re.sub('&#(\d+);',lambda m: '&#{n};'.format(
    n=ord(unichr(int(m.group(1))).encode('cp1252'))),
            lh.tostring(doc))
print(result)
# <div><p>It&#146;s a spicy meatball!</p></div>

网友

2楼 · 编辑于 2024-06-16 14:19:09

您也可以将utf8字符串转换为带xml字符的ascii

result = result.decode('utf-8').encode('ascii', 'xmlcharrefreplace')

网友

3楼 · 编辑于 2024-06-16 14:19:09

有几件事-如果你知道的话-将导致最简单/最好的解决方案：

clean_html()返回提供给它的相同类型：如果给它一个字符串，它将返回一个字符串，但是如果给它一个元素或元素树，它将分别返回一个元素或元素树
您可以通过为lxml.html.tostring()方法或树的write()方法提供编码选项来控制元素或元素树的序列化方式（顺便说一下，xml也是如此）。例如，您可以使用encoding='utf-8'来实现这一点。
任何可以在该编码中编码的内容，都将作为编码字符串输出，任何不能作为实体“转义”的内容。使用encoding="ascii"将强制任何非ascii字符按您所希望的方式使用“nice”实体。

总而言之，这意味着：首先将字符串解析为一个元素（如果您愿意，可以是树），清理它，然后根据需要序列化它：

html = lxml.html.fromstring("<div><!-- word style><bleep><omgz 1,000 tags><--><p>It&#146;s a spicy meatball!</div>")
html = clean_html(html)
result = lxml.html.tostring(html, encoding="ascii")

（还有一个稍微脏一点的技巧是在unicode字符串的encode()方法上使用errors参数：尝试用s.encode('ascii', 'xmlcharrefreplace')对包含“特殊”字符的unicode字符串进行编码，然后看看它能做什么…）

相关问题更多 >

编程相关推荐

热门问题

热门文章