在lxml中过滤无效的unicode字符的中心方法？

9 投票

1 回答

1767 浏览

提问于 2025-04-18 03:30

大家都知道，在XML文档中，有些字符是不能使用的。关于如何过滤掉这些字符，我知道一些解决方案（比如[1]和[2]）。

为了遵循“不要重复自己”的原则，我希望能在一个中心位置实现这些解决方案——现在，我必须在将任何可能不安全的文本传给lxml之前先进行处理。有没有办法做到这一点，比如通过子类化一个lxml的过滤类，捕获一些异常，或者设置一个配置开关呢？

编辑：为了更清楚地说明这个问题，这里有一段示例代码：

from lxml import etree

root = etree.Element("root")
root.text = u'\uffff'
root.text += u'\ud800' 

print(etree.tostring(root))

root.text += '\x02'.decode("utf-8")

执行这段代码会得到以下结果：

<root>&#65535;&#55296;</root>

Traceback (most recent call last):
  File "[…]", line 9, in <module>
    root.text += u'\u0002'
  File "lxml.etree.pyx", line 953, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:44956)
  File "apihelpers.pxi", line 677, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20273)
  File "apihelpers.pxi", line 1395, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26485)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

如你所见，第二个字节抛出了异常，但lxml很高兴地处理了其他两个超出范围的字符。真正的问题是：

s = "<root>&#65535;&#55296;</root>"
root = etree.fromstring(s)

这段代码也会抛出异常。我觉得这种行为有点让人不安，尤其是因为它会生成无效的XML文档。

结果发现，这可能是个2和3的问题。在python3.4中，上面的代码抛出了异常：

Traceback (most recent call last):
  File "[…]", line 5, in <module>
    root.text += u'\ud800'
  File "lxml.etree.pyx", line 953, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:44971)
  File "apihelpers.pxi", line 677, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:20273)
  File "apihelpers.pxi", line 1387, in lxml.etree._utf8 (src/lxml/lxml.etree.c:26380)
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 1: surrogates not allowed

唯一剩下的问题是\uffff字符，lxml仍然很乐意接受它。

1 个回答

在用LXML解析字符串之前，先过滤掉一些无效字符：清理XML中的无效字符（来自lawlesst的代码片段）。

我用你的代码试过了，效果不错，不过你需要在代码片段里添加对re和sys的导入！

from lxml import etree
from cleaner import invalid_xml_remove

root = etree.Element("root")
root.text = u'\uffff'
root.text += u'\ud800' 

print(etree.tostring(root))

root.text += invalid_xml_remove('\x02'.decode("utf-8"))

回答于 2025-04-18 由 Python大师

分享举报

在lxml中过滤无效的unicode字符的中心方法？

1 个回答

撰写回答