为什么ElementTree会引发ParseError？

3条回答

网友

1楼 · 编辑于 2024-06-07 02:48:31

正如@John Machin所建议的，所讨论的文件中确实有可疑的数字实体，尽管错误消息似乎指向了文本中的错误位置。或许流媒体的特性和缓冲使得报告准确的位置变得困难。

事实上，所有这些实体都出现在文本中：

set(['&#x08;', '&#x0E;', '&#x1E;', '&#x1C;', '&#x18;', '&#x04;', '&#x0A;', '&#x0C;', '&#x16;', '&#x14;', '&#x06;', '&#x00;', '&#x10;', '&#x02;', '&#x0D;', '&#x1D;', '&#x0F;', '&#x09;', '&#x1B;', '&#x05;', '&#x15;', '&#x01;', '&#x03;'])

大多数是不允许的。看起来这个解析器相当严格，您需要找到另一个不那么严格的解析器，或者预处理XML。

网友

2楼 · 编辑于 2024-06-07 02:48:31

我不确定这是否回答了您的问题，但是如果您想对元素树引发的ParseError使用异常，您可以这样做：

except ET.ParseError:
            print("catastrophic failure")
            print("last successful: {0}".format(last))

来源：http://effbot.org/zone/elementtree-13-intro.htm

网友

3楼 · 编辑于 2024-06-07 02:48:31

以下是一些想法：

（0）解释“一个文件”和“偶尔”：你真的是说它有时工作，有时失败与相同的文件？

对每个失败的文件执行以下操作：

（1）找出文件中抱怨的地方：

text = open("the_file.xml", "rb").read()
err_col = 52459
print repr(text[err_col-50:err_col+100]) # should include the error text
print repr(text[:50]) # show the XML declaration

（2）将文件放入基于web的XML验证服务，例如http://www.validome.org/xml/或http://validator.aborla.net/

编辑你的问题来展示你的发现。

更新：下面是说明您的问题的最小xml文件：

[badcharref.xml]
<a>&#1;</a>

[Python 2.7.1 output]
>>> import xml.etree.ElementTree as ET
>>> it = ET.iterparse(file("badcharref.xml"))
>>> for ev, el in it:
...     print el.tag
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\python27\lib\xml\etree\ElementTree.py", line 1258, in next
    self._parser.feed(data)
  File "C:\python27\lib\xml\etree\ElementTree.py", line 1624, in feed
    self._raiseerror(v)
  File "C:\python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: reference to invalid character number: line 1, column 3
>>>

并非所有有效的Unicode字符都在XML中有效。请参阅XML 1.0 Specification。

。。。或者数字字符引用在语法上无效，例如没有被;，&#not-a-digit等终止

更新2我错了，ElementTree错误消息中的数字正在计算Unicode代码点，而不是字节。请参阅下面的代码和在两个错误文件上运行代码的输出片段。

# coding: ascii
# Find numeric character references that refer to Unicode code points
# that are not valid in XML.
# Get byte offsets for seeking etc in undecoded file bytestreams.
# Get unicode offsets for checking against ElementTree error message,
# **IF** your input file is small enough. 

BYTE_OFFSETS = True
import sys, re, codecs
fname = sys.argv[1]
print fname
if BYTE_OFFSETS:
    text = open(fname, "rb").read()
else:
    # Assumes file is encoded in UTF-8.
    text = codecs.open(fname, "rb", "utf8").read()
rx = re.compile("&#([0-9]+);|&#x([0-9a-fA-F]+);")
endpos = len(text)
pos = 0
while pos < endpos:
    m = rx.search(text, pos)
    if not m: break
    mstart, mend = m.span()
    target = m.group(1)
    if target:
        num = int(target)
    else:
        num = int(m.group(2), 16)
    # #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    if not(num in (0x9, 0xA, 0xD) or 0x20 <= num <= 0xD7FF
    or 0xE000 <= num <= 0xFFFD or 0x10000 <= num <= 0x10FFFF):
        print mstart, m.group()
    pos = mend

输出：

comments.xml
6615405 &#x10;
10205764 &#x00;
10213901 &#x00;
10213936 &#x00;
10214123 &#x00;
13292514 &#x03;
...
155656543 &#x1B;
155656564 &#x1B;
157344876 &#x10;
157722583 &#x10;

posts.xml
7607143 &#x1F;
12982273 &#x1B;
12982282 &#x1B;
12982292 &#x1B;
12982302 &#x1B;
12982310 &#x1B;
16085949 &#x1C;
16085955 &#x1C;
...
36303479 &#x12;
36303494 &#xFFFF; <<=== whoops
38942863 &#x10;
...
785292911 &#x08;
801282472 &#x13;
848911592 &#x0B;

相关问题更多 >

编程相关推荐

热门问题

热门文章