Python ElementTree 解码 HTML 实体

Question

我写了一个简单的脚本，可以把XML格式的数据转换成用逗号分隔的格式。下面是一个XML源文件的样本：

<?xml version="1.0" encoding="utf-8"?>
<users>
<row Id="-1" Reputation="1" CreationDate="2010-08-10T15:50:26.953" DisplayName="Community" LastAccessDate="2010-08-10T15:50:26.953" Location="on the server farm" AboutMe="&lt;p&gt;Hi, I'm not really a person.&lt;/p&gt;&#xA;&#xA;&lt;p&gt;I'm a background process that helps keep this site clean!&lt;/p&gt;&#xA;&#xA;&lt;p&gt;I do things like&lt;/p&gt;&#xA;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;Randomly poke old unanswered questions every hour so they get some attention&lt;/li&gt;&#xA;&lt;li&gt;Own community questions and answers so nobody gets unnecessary reputation from them&lt;/li&gt;&#xA;&lt;li&gt;Own downvotes on spam/evil posts that get permanently deleted&lt;/li&gt;&#xA;&lt;li&gt;Own suggested edits from anonymous users&lt;/li&gt;&#xA;&lt;li&gt;&lt;a href=&quot;http://meta.stackexchange.com/a/92006&quot;&gt;Remove abandoned questions&lt;/a&gt;&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;" Views="0" UpVotes="3732" DownVotes="2275" AccountId="-1" />
</users>

gist

解析器的相关代码如下：

import xml.etree.cElementTree as cetree

def get_data_c(fn, columns):
    res = ''
    cols = columns.split(',')

    for c in cols:
        res = res + c + ','

    res = res[:-1] + '\n'
    yield res

    for event, elem in cetree.iterparse(fn):
        res = ''
        if elem.tag == "row":
            for c in cols:
                if c in elem.attrib:
                    res = res + elem.attrib[c] + ','
                else:
                    res = res + ','
            res = res[:-1] + '\n'
            yield res
            elem.clear()

gist，这是完整脚本的链接。

我遇到的问题是，当我获取AboutMe这个属性的值时，cElementTree会自动处理这个属性里的HTML内容。理想情况下，我希望保持这个HTML格式不变，只需在输出文件中加上引号。但实际上，我得到的是处理后的字符串，具体情况可以在这个gist中看到。我该如何告诉cElementTree保持属性的原始值，而不是把它转换成HTML呢？

编辑 2014-09-01 12:49 PST：根据下面Tomalak的回答，我使用了以下方法来实现我想要的效果：

def escape_str(html_str):
    s = html.escape(html_str)
    return s.replace('\n', '&#xA;')

我基本上是在获取属性值的调用周围加上了上面的转义函数。像这样：

res = res + '"' + escape_str(elem.attrib[c]) + '",'

XML 解析器数据解析 elementtree HTML 数据转换字符串转义属性处理

Python ElementTree 解码 HTML 实体

1 个回答

撰写回答