XPath:根据当前节点属性选择当前节点和下一个节点的文本
首先,这个问题是我之前提问的延续。我再次发帖是因为我之前得到的回答者建议我这样做,他觉得我的问题没有表述清楚。下面是我的第二次尝试:
我想从这个网页获取一些信息。为了更清楚,下面是网页源代码中的一部分:
<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology
<span class='distribution'>(SCI)</span></p>
<span class='normaltext'>
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is directed to answering the question: What makes us human? This course is a survey of biological anthropology and archaeology. [<span class='Helpcourse'
onMouseover="showtip(this,event,'24 Lectures')"
onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
onMouseover="showtip(this,event,'12 Tutorials')"
onMouseout="hidetip()">12T</span>]<br>
<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br>
<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br>
从上面的示例代码块中,我想提取以下信息:
ANT101H5 生物人类学与考古学导论
排除:ANT100Y5
先修课程:ANT102H5
我希望从这个网页上获取所有这样的信息(请注意,有些课程可能还会列出“共同先修课程”,也可能没有任何先修或排除课程的要求)。
我一直在尝试写一个合适的xpath表达式来完成这个任务,但似乎总是找不到正确的写法。
到目前为止,在Dimitre Novatchev的帮助下,我已经能够使用以下表达式:
sites = hxs.select("(//p[@class='titlestyle'])[2]/text()[1] | (//span[@class='title2'])[2]/text() | \
(//span[@class='title2'])[2]/following-sibling::a[1]/text() | (//span[@class='title2'])[3]/text() | \
(//span[@class='title2'])[3]/following-sibling::a[1]/text()")
然而,这个表达式的输出结果似乎只获取了网页上第一个课程的信息:
[{"desc": "ANT101H5 Introduction to Biological Anthropology and Archaeology \n "},
{"desc": "Exclusion: "},
{"desc": "ANT100Y5"},
{"desc": "Prerequisite: "},
{"desc": "ANT102H5"}]
为了让大家更清楚,这个输出结果是正确的,只是它只获取了第一个课程的正确信息。我需要这样的正确信息,适用于网页上所有列出的课程。
我离目标很近,但似乎还是无法找到最后一步的解决办法。
我非常感谢任何帮助……提前谢谢大家
2 个回答
试着用 [position() mod <offset> = <base>]
来代替 [<int>]
。
这里的 offset 是你关注的每个节点之间的距离。对于 @class='titlestyle' 和 @class='title2',这个距离可能会不同。
ites = hxs.select("(//p[@class='titlestyle'])[position() mod <offset to next to match> = 2]/text()[1] | (//span[@class='title2'])[position() mod <offset to next to match> = 2]/text() | \
(//span[@class='title2'])[position() mod <offset to next to match> = 2]/following-sibling::a[1]/text() | (//span[@class='title2'])[position() mod <offset to next to match> = 3]/text() | \
(//span[@class='title2'])[position() mod <offset to next to match> = 3]/following-sibling::a[1]/text()")
编辑:应要求。
逐个执行每个单独的 xpath,而不限制它的位置。这是一个手动查找的过程,用来确定在 xpath 中使用的最终值。
返回所有匹配以下 xpath 的节点(这是第一个)。
ites = hxs.select("(//p[@class='titlestyle'])/text()[1]")
ites
中会包含一些你想要的类,还有一些你不想要的。
你已经确定这个节点中第二个是你想要的第一个节点。现在计算一下在 ites
中下一个你想要的节点的距离。这就是我们可以称之为 <offset to next to match>
的东西。
现在对剩下的每个 xpath 搜索重复以上步骤。
把 hxs.select("") 想象成一个过滤器,它在遍历 xml 时,会返回每一个符合你 xpath 的内容。
这里有一个例子 http://zvon.org/xxl/XPathTutorial/Output/example22.html
要选取所有课程相关数据的单一XPath表达式其实挺复杂的,所以我决定换个方法,这个方法可以用来生成那个单一的XPath表达式(如果真的有必要的话):
这个简单的XSLT转换:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="p[@class='titlestyle']">
<xsl:text>
===================
</xsl:text>
<xsl:value-of select="text()[1]"/>
</xsl:template>
<xsl:template match=
"span/span[@class='title2'][not(position() >1)]">
<xsl:text>
</xsl:text>
<xsl:value-of select="."/>
<xsl:value-of select="following-sibling::a[1]"/>
<xsl:if test="not(following-sibling::a)">
<xsl:value-of select="following-sibling::text()[1]"/>
</xsl:if>
<xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="text()"/>
</xsl:stylesheet>
当应用在这个页面上: http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html(经过整理,变成一个格式正确的XML文档),就能得到想要的结果:
===================
Anthropology
===================
ANT101H5 Introduction to Biological Anthropology and Archaeology
Exclusion: ANT100Y5
===================
ANT102H5 Introduction to Sociocultural and Linguistic Anthropology
Exclusion: ANT100Y5
===================
ANT200Y5 World Archaeology and Prehistory
Prerequisite: 101H5
===================
ANT203Y5 Biological Anthropology
Prerequisite: 101H5
===================
ANT204Y5 Sociocultural Anthropology
Prerequisite: 101H5
===================
ANT205H5 Introduction to Forensic Anthropology
Prerequisite: 101H5
===================
ANT206Y5 Culture and Communication: Introduction to Linguistic Anthropology
Exclusion: ANT206H5
===================
ANT241Y5 Aboriginal Peoples of North America
===================
ANT299Y5 Research Opportunity Program
===================
ANT304H5 Anthropology and Aboriginal Peoples
Exclusion: ANT304Y5
===================
ANT306H5 Forensic Anthropology Field School
Prerequisite: ANT205H5
===================
ANT308H5 Case Studies in Archaeological Botany and Zoology
Prerequisite: ANT200Y5
===================
ANT309H5 Southeast Asian Archaeology
Prerequisite: ANT200Y5
===================
ANT310H5 Complex Societies
Prerequisite: ANT200Y5
===================
ANT312H5 Archaeological Analysis
Prerequisite: ANT200Y5
===================
ANT313H5 China, Korea and Japan in Prehistory
Prerequisite: ANT200Y5
===================
ANT314H5 Archaeological Theory
Exclusion: ANT411H5
===================
ANT316H5 South Asian Archaeology
Prerequisite: ANT200Y5
===================
ANT317H5 Archaeology of Eastern North America
Prerequisite: ANT200Y5
===================
ANT318H5 Archaeological Fieldwork
Prerequisite: ANT200Y5
===================
ANT320H5 Archaeological Approaches to Technology
Prerequisite: ANT200Y5
===================
ANT322H5 Anthropology of Youth Culture
Exclusion: ANT204Y5
===================
ANT327H5 Agricultural Origins: The Second Revolution
Prerequisite: ANT200Y5
===================
ANT331H5 The Biology of Human Sexuality
Exclusion: ANT330H5
===================
ANT332H5 Human Origins
Exclusion: ANT332Y5
===================
ANT333H5 Human Origins II
Exclusion: ANT332Y5
===================
ANT334H5 Human Osteology
Exclusion: ANT334Y5
===================
ANT335H5 Anthropology of Gender
Exclusion: ANT331Y5
===================
ANT336H5 Molecular Anthropology
Prerequisite: ANT203Y5
===================
ANT338H5 Laboratory Methods in Biological Anthropology
Prerequisite: ANT203Y5
===================
ANT339Y5 Human Adaptation through Biological and Cultural Means
Prerequisite: ANT203Y5
===================
ANT340H5 Osteological Theory
Exclusion: ANT334Y5
===================
ANT350H5 Globalization and the Changing World of Work
Prerequisite: ANT204Y5
===================
ANT351H5 Money, Markets, Gifts: Topics in Economic Anthropology
Prerequisite: ANT204Y5
===================
ANT352H5 Power, Authority, and Legitimacy: Topics in Political Anthropology
Prerequisite: ANT204Y5
===================
ANT358H5 Ethnographic Methods
Prerequisite: ANT204Y5
===================
ANT360H5 Anthropology of Religion
Exclusion: ANT209Y5
===================
ANT361H5 Anthropology of Sub-Saharan Africa
Exclusion: ANT212Y5
===================
ANT362H5 Language in Culture and Society
Prerequisite: ANT204Y5
===================
ANT363H5 Magic, Witchcraft and Science
Prerequisite: ANT360H5
===================
ANT364H5 Lab in Social Interaction
Prerequisite: ANT206H5
===================
ANT365H5 Semiotic Anthropology
Prerequisite: ANT204Y5
===================
ANT368H5 World Religions and Ecology
Exclusion: RLG311H5
===================
ANT369H5 Religious Violence and Nonviolence
Exclusion: RLG317H5
===================
ANT397H5 Independent Study
Prerequisite: Permission of Faculty Advisor
===================
ANT398Y5 Independent Reading
Prerequisite: Permission of Faculty Advisor
===================
ANT399Y5 Research Opportunity Program
Prerequisite: P.I.
===================
ANT401H5 Vocal and Visual Communication
Prerequisite: ANT102H5
===================
ANT414H5 People and Plants in Prehistory
Prerequisite: ANT200Y5
===================
ANT415H5 Faunal Archaeo-Osteology
Exclusion: ANT415Y5
===================
ANT416H5 Advanced Archaeological Analysis
Prerequisite: ANT312H5
===================
ANT418H5 Advanced Archaeological Fieldwork
Prerequisite: ANT318H5
===================
ANT430H5 Special Problems in Biological Anthropology and Archaeology
Prerequisite: P.I
===================
ANT430Y5 Special Problems in Biological Anthropology and Archaeology
Prerequisite: P.I.
===================
ANT431Y5 Special Problems in Sociocultural or Linguistic Anthropology
Prerequisite: P.I.
===================
ANT431H5 Special Problems in Sociocultural or Linguistic Anthropology
Prerequisite: P.I.
===================
ANT432H5 Special Seminar in Anthropology
Prerequisite: P.I.
===================
ANT433H5 Genes, Language, Artifact and Mind
Prerequisite: ANT200Y5
===================
ANT434H5 Palaeopathology
Prerequisite: ANT334Y5
===================
ANT438H5 The Development of Thought in Biological Anthropology
Prerequisite: ANT203Y5
===================
ANT439Y5 Advanced Forensic Anthropology
Prerequisite: ANT205H5
===================
ANT441H5 Advanced Bioarchaeology
Prerequisite: ANT334H5
===================
ANT457H5 Anthropology and the Environment
Prerequisite: ANT102H5
===================
ANT458H5 Anthropology of Crime, Law and Order
Exclusion: ANT204Y5
===================
ANT459H5 The Ethnography of Speaking
Prerequisite: ANT206Y5
===================
ANT460H5 Theory in Sociocultural Anthropology
Prerequisite: ANT204Y5
===================
ANT461H5 Emergent Topics in Socio-Cultural & Linguistic Anthropology
Prerequisite: ANT204Y5
===================
ANT498H5 Advanced Independent Study
Prerequisite: P.I.
===================
ANT499Y5 Advanced Independent Research
Prerequisite: P.I.