lxml忽略任何位于特定标记之间的标记

2024-04-24 16:43:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图从一个巨大的xml文件中提取一些特定的字段。举个例子:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
    <dblp>

<article mdate="2009-09-24" key="journals/jasis/GianoliM09">
<author>Ernesto Gianoli</author>
<author>Marco A. Molina-Montenegro</author>
<title>Insights into the relationship between the <i>h</i>-index and self-citations.</title>
<pages>1283-1285</pages>
<year>2009</year>
<volume>60</volume>
<journal>JASIST</journal>
<number>6</number>
<ee>http://dx.doi.org/10.1002/asi.21042</ee>
<url>db/journals/jasis/jasis60.html#GianoliM09</url>
</article>


<article mdate="2014-09-18" key="journals/iacr/ShiCSL11" publtype="informal publication">
<author>Elaine Shi</author>
<author>T.-H. Hubert Chan</author>
<author>Emil Stefanov</author>
<author>Mingfei Li</author>
<title>blivious RAM with O((log N)<sup>3</sup>) Worst-Case Cost.</title>
<pages>407</pages>
<year>2011</year>
<volume>2011</volume>
<journal>IACR Cryptology ePrint Archive</journal>
<ee>http://eprint.iacr.org/2011/407</ee>
<url>db/journals/iacr/iacr2011.html#ShiCSL11</url>
</article>

<phdthesis mdate="2016-05-04" key="phd/it/Popescu2008">
<author>Razvan Andrei Popescu</author>
<title>Aggregation and adaptation of web services: a semi-automated methodology for the aggregation and adaption of web services.</title>
<year>2008</year>
<school>University of Pisa</school>
<pages>1-206</pages>
<isbn>978-3-8364-6280-8</isbn>
<ee>http://d-nb.info/991165179</ee>
</phdthesis><phdthesis mdate="2007-04-26" key="phd/Tsangaris92">
<author>Manolis M. Tsangaris</author>
<title>Principles of Static Clustering for Object Oriented Databases</title>
<year>1992</year>
<school>Univ. of Wisconsin-Madison</school>
</phdthesis>

<phdthesis mdate="2005-11-30" key="phd/Heuer2002">
<author>Andreas Heuer 0002</author>
<title>Web-Pr&auml;senz-Management im Unternehmen</title>
<year>2002</year>
<school>Univ. Trier, FB 4, Informatik</school>
<ee>http://ubt.opus.hbz-nrw.de/volltexte/2004/144/</ee>
</phdthesis>

<mastersthesis mdate="2002-01-03" key="phd/Schulte92">
<author>Christian Schulte</author>
<title>Entwurf und Implementierung eines &uuml;bersetzenden Systems f&uuml;r das intuitionistische logische Programmieren auf der Warren Abstract Machine.</title>
<year>1991</year>
<school>Universit&auml;t Karlsruhe, Institut f&uuml;r Logik, Komplexit&auml;t und Deduktionssysteme</school>
</mastersthesis>

<phdthesis mdate="2002-01-03" key="phd/Hellerstein95">
<author>Joseph M. Hellerstein</author>
<title>Optimization and Execution Techniques for Queries With Expensive Methods</title>
<year>1995</year>
<school>Univ. of Wisconsin-Madison</school>
</phdthesis>

</dblp>

我使用代码here来解析和提取我感兴趣的字段。由于<i>h</i><sup>3</sup>标记,当我想在第一种情况和第二种情况下提取标题时,问题就出现了。我的代码似乎将它们视为新事件,而不是<title>标记的一部分,结果如下:

title Insights into the relationship between the
blivious RAM with O((log N)

基本上,在解析器遇到新的标记之前,我都会得到标题文本。你知道吗

问题是我不知道有多少这样的案例(例如,不同的标签),否则我可以尝试手动删除它们。有没有办法处理这样的案子?你知道吗


Tags: ofthekeytitlearticlepagesyearee
1条回答
网友
1楼 · 发布于 2024-04-24 16:43:26

您需要了解元素内容的lxml数据模型(特别是^{}属性)。这里有很好的解释:http://infohost.nmt.edu/tcc/help/pubs/pylxml/web/etree-view.html。你知道吗

这个元素的text属性的内容

<title>Insights into the relationship between the <i>h</i>-index and self-citations.</title>

Insights into the relationship between the。你知道吗

h位是<i>子元素的text-index and self-citations.是同一子元素的tail。你知道吗


为了获得标题的所有文本内容,可以使用^{}。示例:

from lxml import etree

tree = etree.parse("dblp.xml")  # The XML in the question
titles = tree.xpath("//title")

for title in titles:
    print ''.join(title.itertext())

输出:

Insights into the relationship between the h-index and self-citations.
blivious RAM with O((log N)3) Worst-Case Cost.
Aggregation and adaptation of web services: a semi-automated methodology for the aggregation and adaption of web services.
Principles of Static Clustering for Object Oriented Databases
Web-Präsenz-Management im Unternehmen
Entwurf und Implementierung eines übersetzenden Systems für das intuitionistische logische Programmieren auf der Warren Abstract Machine.
Optimization and Execution Techniques for Queries With Expensive Methods

相关问题 更多 >