lxml解析时扭曲xml文件

2024-05-23 08:06:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经收集了一个包含多种语言(包括阿拉伯语)文本的xml文件,当我在Oxygen、emacs、textedit等中打开它时,它会以应有的方式显示出来

xml = etree.parse(data_file) 

把它打印到屏幕上字符被扭曲

string = etree.tostring(xml)

导致以下错误:

   Traceback (most recent call last):
  File "ideo-analysis.py", line 53, in <module>
    main()
  File "ideo-analysis.py", line 40, in main
    xml = etree.parse(data_file, parser)
  File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:81117)
  File "src/lxml/parser.pxi", line 1811, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:117848)
  File "src/lxml/parser.pxi", line 1837, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:118195)
  File "src/lxml/parser.pxi", line 1741, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:117107)
  File "src/lxml/parser.pxi", line 1138, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:111653)
  File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105109)
  File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106817)
  File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105671)
  File "data/ideo.xml", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1`

我花了8个小时研究这个问题,却不知道问题出在哪里。我认为这可能是一个编码问题,但我已经尝试检查编码,它似乎是utf-8(除非我遗漏了一些东西)。你知道吗

xmlp = etree.XMLParser(encoding="utf-16")

data_file = 'data/ideo.xml'
xml = etree.parse(data_file)
string = etree.tostring(xml)`

任何帮助都将不胜感激。我被难住了。你知道吗

以下是xml示例:

<srw:records xmlns:srw="http://www.loc.gov/zing/srw/" xmlns:ns7="http://gallica.bnf.fr/namespaces/gallica/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:onix_dc="http://bibnum.bnf.fr/NS/onix_dc/" xmlns:onix="http://www.editeur.org/onix/2.1/reference/" xmlns:dc="http://purl.org/dc/elements/1.1/">
        <srw:record>
            <srw:recordSchema>http://www.openarchives.org/OAI/2.0/OAIdc.xsd</srw:recordSchema>
            <srw:recordPacking>xml</srw:recordPacking>
            <srw:recordData>
                <oai_dc:dc>
                    <dc:contributor>Kamāl al-Dīn, ʿUṯmān. Correcteur</dc:contributor>
                    <dc:creator>Kamāl al-Dīn, ʿUṯmān. Auteur du texte</dc:creator>
                    <dc:description>[Raʾs al-ḥikmaẗ. 1891]</dc:description>
                    <dc:description>Numérisé par le partenaire</dc:description>
                    <dc:description>Appartient à l’ensemble documentaire : BbLevt0</dc:description>
                    <dc:description>Numérisé par le partenaire</dc:description>
                    <dc:format>32 pages ; 13 × 19 cm</dc:format>
                    <dc:format>Nombre total de vues : 41</dc:format>
                    <dc:identifier>https://gallica.bnf.fr/ark:/12148/bpt6k9105978p</dc:identifier>
                    <dc:language>ara</dc:language>
                    <dc:language>arabe</dc:language>
                    <dc:relation>Notice du catalogue : http://catalogue.bnf.fr/ark:/12148/cb44588848p</dc:relation>
                    <dc:relation>Notice dans un autre catalogue : http://alkindi.ideo-cairo.org/manifestation/8045</dc:relation>
                    <dc:rights>domaine public</dc:rights>
                    <dc:rights>public domain</dc:rights>
                    <dc:source>Institut dominicain d'études orientales, 9-759-31</dc:source>
                    <dc:title>كتاب رأس الحكمة : ويليه أمثال سيدنا علي كرم الله وجهه (الطبعة الأولى) / تأليف المعتصم بحبل الله المتين عثمان كمال الدين</dc:title>
                    <dc:title>Kitāb Raʾs al-ḥikmaẗ : wa-yalīhi Amṯāl sayyidinā ʿAlī karama Allâh waǧhahu (الطبعة الأولى) / Taʾlīf al-muʿtaṣim bi-ḥabli Allâh al-matīn ʿUṯmān Kamāl al-Dīn</dc:title>
                    <dc:type>text</dc:type>
                    <dc:type>monographie imprimée</dc:type>
                    <dc:type>monographie imprimée</dc:type>
                </oai_dc:dc>
            </srw:recordData>
            <srw:recordIdentifier>ark</srw:recordIdentifier>
            <srw:recordPosition>0</srw:recordPosition>
            <srw:extraRecordData>
                <epubFile/>
                <infoSupModifiable/>
                <link>https://gallica.bnf.fr/ark:/12148/bpt6k9105978p </link>
                <nqamoyen>0.0</nqamoyen>
                <thumbnail>https://gallica.bnf.fr/ark:/12148/bpt6k9105978p.thumbnail</thumbnail>
                <typedoc>monographies</typedoc>
            </srw:extraRecordData>
        </srw:record>
        <srw:record>
            <srw:recordSchema>http://www.openarchives.org/OAI/2.0/OAIdc.xsd</srw:recordSchema>
            <srw:recordPacking>xml</srw:recordPacking>
            <srw:recordData>
                <oai_dc:dc>
                    <dc:creator>ʿAṭār, Ḥasan ibn Muḥammad al- (1766?-1835). Auteur du texte</dc:creator>
                    <dc:description>[Ḥāšiyaẗ ʿalá Šarḥ Zakariyā al-Anṣārī ʿalá matn Īsāġūǧī fī al-manṭiq. 1820]</dc:description>
                    <dc:description>Comprend : Zakariyā al-Anṣārī, Zakariyā ibn Muḥammad, 1423?-1520?</dc:description>
                    <dc:description>Numérisé par le partenaire</dc:description>
                    <dc:description>Appartient à l’ensemble documentaire : BbLevt0</dc:description>
                    <dc:description>Numérisé par le partenaire</dc:description>
                    <dc:format>98 pages ; 19 × 27 cm</dc:format>
                    <dc:format>Nombre total de vues : 107</dc:format>
                    <dc:identifier>https://gallica.bnf.fr/ark:/12148/bpt6k9106147p</dc:identifier>
                    <dc:language>ara</dc:language>
                    <dc:language>arabe</dc:language>
                    <dc:relation>Notice du catalogue : http://catalogue.bnf.fr/ark:/12148/cb445888251</dc:relation>
                    <dc:relation>Notice dans un autre catalogue : http://alkindi.ideo-cairo.org/manifestation/71624</dc:relation>
                    <dc:rights>domaine public</dc:rights>
                    <dc:rights>public domain</dc:rights>
                    <dc:source>Institut dominicain d'études orientales, 9-614-66</dc:source>
                    <dc:title>هذه حاشية العالم العلامة والحبر البحر الفهامة وحيد عصره وفريد دهره الفاضل الشيخ حسن العطار على شرح شيخ الإسلام زكريا الأنصاري على متن ايساغوجي في المنطق (...) : وبهامشها الشرح المذكور</dc:title>
                    <dc:title>Hâḏihī Ḥāšiyaẗ al-ʿālim al-ʿallāmaẗ wa-al-ḥabr al-baḥr al-fahhāmaẗ waḥīd ʿaṣrihi wa-farīd dahrihi al-fāḍil al-šayḫ Ḥasan al-ʿAṭṭār ʿalá Šarḥ šayḫ al-islām Zakariyā al-Anṣārī ʿalá matn Īsāġūǧī fī al-manṭiq (...) : wa-bi-hāmišihi al-Šarḥ al-maḏkūr</dc:title>
                    <dc:type>text</dc:type>
                    <dc:type>monographie imprimée</dc:type>
                    <dc:type>monographie imprimée</dc:type>
                </oai_dc:dc>
            </srw:recordData>
            <srw:recordIdentifier>ark</srw:recordIdentifier>
            <srw:recordPosition>1</srw:recordPosition>
            <srw:extraRecordData>
                <epubFile/>
                <infoSupModifiable/>
                <link>https://gallica.bnf.fr/ark:/12148/bpt6k9106147p </link>
                <nqamoyen>0.0</nqamoyen>
                <thumbnail>https://gallica.bnf.fr/ark:/12148/bpt6k9106147p.thumbnail</thumbnail>
                <typedoc>monographies</typedoc>
            </srw:extraRecordData>
        </srw:record>
        <srw:record>
            <srw:recordSchema>http://www.openarchives.org/OAI/2.0/OAIdc.xsd</srw:recordSchema>
            <srw:recordPacking>xml</srw:recordPacking>
            <srw:recordData>
                <oai_dc:dc>
                    <dc:contributor>Muwayliḥī, Muḥammad Ibrāhīm al- (1858-1930). Préfacier</dc:contributor>
                    <dc:creator>Bišrī, Salīm ibn Abī Faraǧ al- (1867-1917). Auteur du texte</dc:creator>
                    <dc:creator>Šawqī, Aḥmad (1868-1932). Auteur du texte</dc:creator>
                    <dc:description>[Waḍaḥ al-nahǧ. 1910]</dc:description>
                    <dc:description>Numérisé par le partenaire</dc:description>
                    <dc:description>Appartient à l’ensemble documentaire : BbLevt0</dc:description>
                    <dc:description>Numérisé par le partenaire</dc:description>
                    <dc:format>8, 123 pages ; 16 × 25 cm</dc:format>
                    <dc:format>Nombre total de vues : 147</dc:format>
                    <dc:identifier>https://gallica.bnf.fr/ark:/12148/bpt6k91061253</dc:identifier>
                    <dc:language>ara</dc:language>
                    <dc:language>arabe</dc:language>
                    <dc:relation>Notice du catalogue : http://catalogue.bnf.fr/ark:/12148/cb44588791j</dc:relation>
                    <dc:relation>Notice dans un autre catalogue : http://alkindi.ideo-cairo.org/manifestation/12260</dc:relation>
                    <dc:rights>domaine public</dc:rights>
                    <dc:rights>public domain</dc:rights>
                    <dc:source>Institut dominicain d'études orientales, 9-309-19</dc:source>
                    <dc:title>نهج البردة نظم أحمد شوقي وعليه وضح النهج (الطبعة الأولى) / شرح مولانا الأستاذ الأكبر شيخ الجامع الأزهر الشيخ سليم البشري</dc:title>
                    <dc:title>Nahǧ al-burdaẗ naẓm Aḥmad Šawqī wa-ʿalayhi Waḍaḥ al-nahǧ (الطبعة الأولى) / šarḥ mawlānā al-ustāḏ al-akbar šayẖ al-Ǧāmiʿ al-Azhar al-Šayẖ Salīm al-Bišrī</dc:title>
                    <dc:type>text</dc:type>
                    <dc:type>monographie imprimée</dc:type>
                    <dc:type>monographie imprimée</dc:type>
                </oai_dc:dc>
            </srw:recordData>
            <srw:recordIdentifier>ark</srw:recordIdentifier>
            <srw:recordPosition>2</srw:recordPosition>
            <srw:extraRecordData>
                <epubFile/>
                <infoSupModifiable/>
                <link>https://gallica.bnf.fr/ark:/12148/bpt6k91061253 </link>
                <nqamoyen>0.0</nqamoyen>
                <thumbnail>https://gallica.bnf.fr/ark:/12148/bpt6k91061253.thumbnail</thumbnail>
                <typedoc>monographies</typedoc>
            </srw:extraRecordData>
        </srw:record>
</srw:records>

以下是uuencode示例:

begin 644 ideo2.xml
M/'-R=SIR96-O<F1S('AM;&YS.G-R=STB:'1T<#HO+W=W=RYL;V,N9V]V+WII
M;F<O<W)W+R(@>&UL;G,Z;G,W/2)H='1P.B\O9V%L;&EC82YB;F8N9G(O;F%M
M97-P86-E<R]G86QL:6-A+R(@>&UL;G,Z;V%I7V1C/2)H='1P.B\O=W=W+F]P
M96YA<F-H:79E<RYO<F<O3T%)+S(N,"]O86E?9&,O(B!X;6QN<SIO;FEX7V1C
M/2)H='1P.B\O8FEB;G5M+F)N9BYF<B].4R]O;FEX7V1C+R(@>&UL;G,Z;VYI
M>#TB:'1T<#HO+W=W=RYE9&ET975R+F]R9R]O;FEX+S(N,2]R969E<F5N8V4O
M(B!X;6QN<SID8STB:'1T<#HO+W!U<FPN;W)G+V1C+V5L96UE;G1S+S$N,2\B
M/@H@("`@("`@(#QS<G<Z<F5C;W)D/@H@("`@("`@("`@("`\<W)W.G)E8V]R
M9%-C:&5M83YH='1P.B\O=W=W+F]P96YA<F-H:79E<RYO<F<O3T%)+S(N,"]/
M04ED8RYX<V0\+W-R=SIR96-O<F138VAE;6$^"B`@("`@("`@("`@(#QS<G<Z
M<F5C;W)D4&%C:VEN9SYX;6P\+W-R=SIR96-O<F1086-K:6YG/@H@("`@("`@
M("`@("`\<W)W.G)E8V]R9$1A=&$^"B`@("`@("`@("`@("`@("`\;V%I7V1C
M.F1C/@H@("`@("`@("`@("`@("`@("`@(#QD8SIC;VYT<FEB=71O<CY+86W$
M@6P@86PM1,2K;BP@RK]5X;FO;<2!;BX@0V]R<F5C=&5U<CPO9&,Z8V]N=')I
M8G5T;W(^"B`@("`@("`@("`@("`@("`@("`@/&1C.F-R96%T;W(^2V%MQ(%L
M(&%L+43$JVXL(,J_5>&YKVW$@6XN($%U=&5U<B!D=2!T97AT93PO9&,Z8W)E
M871O<CX*("`@("`@("`@("`@("`@("`@("`\9&,Z9&5S8W)I<'1I;VX^6U)A
MRKYS(&%L+>&XI6EK;6'ANI<N(#$X.3%=/"]D8SID97-C<FEP=&EO;CX*("`@
M("`@("`@("`@("`@("`@("`\9&,Z9&5S8W)I<'1I;VX^3G5MPZER:7/#J2!P
M87(@;&4@<&%R=&5N86ER93PO9&,Z9&5S8W)I<'1I;VX^"B`@("`@("`@("`@
M("`@("`@("`@/&1C.F1E<V-R:7!T:6]N/D%P<&%R=&EE;G0@PZ`@;.*`F65N
M<V5M8FQE(&1O8W5M96YT86ER92`Z($)B3&5V=#`\+V1C.F1E<V-R:7!T:6]N
M/@H@("`@("`@("`@("`@("`@("`@(#QD8SID97-C<FEP=&EO;CY.=6W#J7)I
M<\.I('!A<B!L92!P87)T96YA:7)E/"]D8SID97-C<FEP=&EO;CX*("`@("`@
M("`@("`@("`@("`@("`\9&,Z9F]R;6%T/C,R('!A9V5S(#L@,3,@PY<@,3D@
M8VT\+V1C.F9O<FUA=#X*("`@("`@("`@("`@("`@("`@("`\9&,Z9F]R;6%T
M/DYO;6)R92!T;W1A;"!D92!V=65S(#H@-#$\+V1C.F9O<FUA=#X*("`@("`@
M("`@("`@("`@("`@("`\9&,Z:61E;G1I9FEE<CYH='1P<SHO+V=A;&QI8V$N
M8FYF+F9R+V%R:SHO,3(Q-#@O8G!T-FLY,3`U.3<X<#PO9&,Z:61E;G1I9FEE
M<CX*("`@("`@("`@("`@("`@("`@("`\9&,Z;&%N9W5A9V4^87)A/"]D8SIL
M86YG=6%G93X*("`@("`@("`@("`@("`@("`@("`\9&,Z<FEG:'1S/G!U8FQI
M8R!D;VUA:6X\+V1C.G)I9VAT<SX*("`@("`@("`@("`@("`@("`@("`\9&,Z
M<V]U<F-E/DEN<W1I='5T(&1O;6EN:6-A:6X@9"?#J71U9&5S(&]R:65N=&%L
M97,L(#DM-S4Y+3,Q/"]D8SIS;W5R8V4^"B`@("`@("`@("`@("`@("`@("`@
M/&1C.G1I=&QE/MF#V*K8I]BH(-BQV*/8LR#8I]F$V*W9@]F%V*D@.B#9B-F*
MV839BMF'(-BCV878J]BGV80@V+/9BMBOV8;8IR#8N=F$V8H@V8/8L=F%(-BG
MV839A-F'(-F(V*S9A]F'("C8I]F$V+?8J-BYV*D@V*?9A-BCV8C9A-F)*2`O
M(-BJV*/9A-F*V8$@V*?9A-F%V+G8JMBUV84@V*C8K=BHV80@V*?9A-F$V8<@
MV*?9A-F%V*K9BMF&(-BYV*O9A=BGV88@V8/9A=BGV80@V*?9A-BOV8K9ACPO
M9&,Z=&ET;&4^"B`@("`@("`@("`@("`@("`@("`@/&1C.G1I=&QE/DMI=,2!
M8B!28<J^<R!A;"WAN*5I:VUAX;J7(#H@=V$M>6%LQ*MH:2!!;>&YK\2!;"!S
M87EY:61I;L2!(,J_06S$JR!K87)A;6$@06QLPZ)H('=AQZ=H86AU("C8I]F$
MV+?8J-BYV*D@V*?9A-BCV8C9A-F)*2`O(%1ARKYLQ*MF(&%L+6UURK]T8>&Y
MHVEM(&)I+>&XI6%B;&D@06QLPZ)H(&%L+6UA=,2K;B#*OU7AN:]MQ(%N($MA
M;<2!;"!A;"U$Q*MN/"]D8SIT:71L93X*("`@("`@("`@("`@("`@("`@("`\
M9&,Z='EP93YT97AT/"]D8SIT>7!E/@H@("`@("`@("`@("`@("`@("`@(#QD
M8SIT>7!E/FUO;F]G<F%P:&EE(&EM<')I;<.I93PO9&,Z='EP93X*("`@("`@
M("`@("`@("`@("`@("`\9&,Z='EP93YM;VYO9W)A<&AI92!I;7!R:6W#J64\
M+V1C.G1Y<&4^"B`@("`@("`@("`@("`@("`\+V]A:5]D8SID8SX*("`@("`@
M("`@("`@/"]S<G<Z<F5C;W)D1&%T83X*("`@("`@("`\+W-R=SIR96-O<F0^
0"CPO<W)W.G)E8V]R9',^"@``
`
end

Tags: srcformathttptypelinedescriptionxmlfr

热门问题