使用schemaLocation用LXML验证XML

2024-04-25 09:55:48 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用lxml验证以下XML

<?xml version='1.0' encoding='UTF-8'?>
<mets:mets xmlns:mets="http://www.loc.gov/METS/"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  xmlns:csip="https://DILCIS.eu/XML/METS/CSIPExtensionMETS"
  xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd https://DILCIS.eu/XML/METS/CSIPExtensionMETS https://earkcsip.dilcis.eu/schema/DILCISExtensionMETS.xsd">
  <mets:metsHdr>
    <mets:agent ROLE="ARCHIVIST" TYPE="ORGANIZATION">
      <mets:name>foo</mets:name>
      <mets:note csip:NOTETYPE="this is incorrect">bar</mets:note>
    </mets:agent>
  </mets:metsHdr>
  <mets:structMap>
    <mets:div/>
  </mets:structMap>
</mets:mets>

我采用了here中的脚本(并添加了一些小的CLI改进和python3修复):

import sys

from lxml import etree

XSI = "http://www.w3.org/2001/XMLSchema-instance"
XS = '{http://www.w3.org/2001/XMLSchema}'


SCHEMA_TEMPLATE = b"""<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns="http://dummy.libxml2.validator"
targetNamespace="http://dummy.libxml2.validator"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
version="1.0"
elementFormDefault="qualified"
attributeFormDefault="unqualified">
</xs:schema>"""


def validate_XML(xml):
    """Validate an XML file represented as string. Follow all schemaLocations.
    :param xml: path to xml.
    :type xml: str
    """
    tree = etree.parse(xml)
    schema_tree = etree.XML(SCHEMA_TEMPLATE)
    # Find all unique instances of 'xsi:schemaLocation="<namespace> <path-to-schema.xsd> ..."'
    schema_locations = set(tree.xpath("//*/@xsi:schemaLocation", namespaces={'xsi': XSI}))
    for schema_location in schema_locations:
        # Split namespaces and schema locations ; use strip to remove leading
        # and trailing whitespace.
        namespaces_locations = schema_location.strip().split()
        # Import all found namspace/schema location pairs
        for namespace, location in zip(*[iter(namespaces_locations)] * 2):
            xs_import = etree.Element(XS + "import")
            xs_import.attrib['namespace'] = namespace
            xs_import.attrib['schemaLocation'] = location
            schema_tree.append(xs_import)
    # Contstruct the schema
    schema = etree.XMLSchema(schema_tree)
    # Validate!
    schema.assertValid(tree)
    print('Success!')


if __name__ == '__main__':
   validate_XML(sys.argv[1])

现在我希望验证不会说NOTETYPE包含无效值(只有值SOFTWARE VERSION有效),但是验证完成时没有任何错误。你知道吗

在诸如XML编辑器之类的工具中使用相同的文件会产生预期的错误:

Value 'this is incorrect' is not facet-valid with respect to enumeration '[SOFTWARE VERSION]'. It must be a value from the enumeration.

生成的架构:

<xs:schema xmlns="http://dummy.libxml2.validator" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" targetNamespace="http://dummy.libxml2.validator" version="1.0" elementFormDefault="qualified" attributeFormDefault="unqualified">
    <xs:import namespace="http://www.loc.gov/METS/" schemaLocation="http://www.loc.gov/standards/mets/mets.xsd"/>
    <xs:import namespace="https://DILCIS.eu/XML/METS/CSIPExtensionMETS" schemaLocation="https://earkcsip.dilcis.eu/schema/DILCISExtensionMETS.xsd"/>
</xs:schema>

Tags: orgimporthttpschemawwwxmlxmlschemaxs