XPath：通过当前节点属性选择当前和下一个节点的文本

2 投票

3 回答

2794 浏览

提问于 2025-04-16 13:02

如果这个问题已经有人问过，我先说声抱歉，但我在StackOverflow或者其他地方找不到类似的问题。我的问题是：

我正在使用 scrapy 从这个网页上获取一些信息。为了更清楚，下面是我关注的网页源代码的一部分：

<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology 
                        <span class='distribution'>(SCI)</span></p> 

<span class='normaltext'> 
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is  directed  to answering the question: What makes us human? This course is a survey of  biological  anthropology and  archaeology.  [<span class='Helpcourse'
            onMouseover="showtip(this,event,'24 Lectures')"
            onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
            onMouseover="showtip(this,event,'12 Tutorials')"
            onMouseout="hidetip()">12T</span>]<br> 

<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br>

<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br> 
</span><br/><br/<br/>

几乎这个页面上的所有代码都像上面的这段。

从这些代码中，我需要提取：

ANT101H5 生物人类学与考古学导论
排除：ANT100Y5
先修课程：ANT102H5

问题是 排除： 在一个  标签里面，而 ANT100Y5 在接下来的 <a> 标签里面。

我似乎无法从这个源代码中同时提取出这两个信息。目前，我有一段代码尝试（但失败了）提取 ANT100Y5，这段代码看起来是这样的：

hxs = HtmlXPathSelector(response)
    sites = hxs.select("//*[(name() = 'p' and @class = 'titlestyle') or (name() = 'a' and @href and preceding-sibling::'//span/@class=title2')]")

我非常感谢任何帮助，即使是“你没看到其他SO问题，这个问题回答得很好”的那种帮助（如果是这样，我自己会投票关闭这个问题）。我真的快要抓狂了。

提前谢谢你们

编辑：根据 @Dimitre 的建议修改后的完整原始代码

我正在使用以下代码：

class regcalSpider(BaseSpider):
    name = "disc"
    allowed_domains = ['www.utm.utoronto.ca']
    start_urls = ['http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html']

    def parse(self, response):
            items = []
            hxs = HtmlXPathSelector(response)
            sites = hxs.select("/*/p/text()[1] | \
                              (//span[@class='title2'])[1]/text() | \
                              (//span[@class='title2'])[1]/following-sibling::a[1]/text() | \
                              (//span[@class='title2'])[2]/text() | \
                              (//span[@class='title2'])[2]/following-sibling::a[1]/text()")

            for site in sites:
                    item = RegcalItem()
                    item['title'] = site.select("a/text()").extract()
                    item['link'] = site.select("a/@href").extract()
                    item['desc'] = site.select("text()").extract()
                    items.append(item)
            return items

            filename = response.url.split("/")[-2]
            open(filename, 'wb').write(response.body)

这段代码给我的结果是：

[{"title": [], "link": [], "desc": []},
 {"title": [], "link": [], "desc": []},
 {"title": [], "link": [], "desc": []}]

这不是我需要的输出。我哪里做错了？请记住，我是在这个网页上运行这个脚本的。

3 个回答

选择你提到的三个节点其实并不难，可以用一些像Flack这样的技术来实现。真正困难的是（a）在选择它们的时候不把其他不想要的东西也选上，以及（b）确保你的选择足够稳健，即使输入稍微有点不同，它们依然能被选中。我们必须假设你并不知道输入里具体有什么——如果你知道，那就不需要写XPath表达式来找了。

你告诉我们你想抓取的三样东西，但你选择这三样东西的标准是什么？为什么不选择其他的东西呢？你对自己要找的东西了解多少呢？

你把问题描述成一个XPath的问题，但我会用不同的方法来解决。我会先把你展示的输入转换成结构更好的形式，使用XSLT。具体来说，我会尝试把所有不在元素里的兄弟元素包裹进元素，把每一组以 结尾的连续元素当作一个段落。用XSLT 2.0中的<xsl:for-each-group group-ending-with>这个结构来实现，这并不会太困难。

回答于 2025-04-16 由 Python大师

分享举报

1. ANT101H5 生物人类学与考古学入门

p[@class='titlestyle']/text()

2. 排除：ANT100Y5

concat(
    span/span[@class='title2'][1],
    span/span[@class='title2'][1]/following-sibling::a[1]
    )

3. 先修课程：ANT102H5

concat(
    span/span[@class='title2'][2],
    span/span[@class='title2'][2]/following-sibling::a[1]
    )

回答于 2025-04-16 由 Python大师

分享举报

我的回答和@Flack的很相似：

有这个XML文档（我修正了提供的文档，关闭了许多未关闭的 标签，并把所有内容包裹在一个顶层元素中）：

<body>
    <p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology 
        <span class='distribution'>(SCI)</span>
    </p>
    <span class='normaltext'> Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is directed to answering the question: What makes us human? This course is a survey of biological anthropology and archaeology. [
        <span class='Helpcourse' onMouseover="showtip(this,event,'24 Lectures')" onMouseout="hidetip()">24L</span>, 
        <span class='Helpcourse' onMouseover="showtip(this,event,'12 Tutorials')" onMouseout="hidetip()">12T</span>]
        <br/>
        <span class='title2'>Exclusion: </span>
        <a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a>
        <br/>
        <span class='title2'>Prerequisite: </span>
        <a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a>
        <br/>
    </span>
    <br/>
    <br/>
    <br/>
</body>

这个XPath表达式：

normalize-space(/*/p/text()[1])

在评估后会产生想要的字符串（结果中没有周围的引号，我加上它们是为了显示生成的确切字符串）：

"ANT101H5 Introduction to Biological Anthropology and Archaeology"

这个XPath表达式：

concat((//span[@class='title2'])[1],
            (//span[@class='title2'])[1]
                   /following-sibling::a[1]
            )

在评估后会产生以下想要的结果：

"Exclusion: ANT100Y5"

这个XPath表达式：

concat((//span[@class='title2'])[2],
            (//span[@class='title2'])[2]
                   /following-sibling::a[1]
            )

在评估后会产生以下想要的结果：

"Prerequisite: ANT102H5"

注意：在这个特定情况下，缩写//并不是必需的，实际上在可能的情况下应该尽量避免使用这个缩写，因为它会导致表达式的评估变慢，很多情况下会导致整个（子）树的遍历。我故意使用'//'，因为提供的XML片段没有给我们完整的XML文档结构。此外，这也演示了如何正确索引使用//的结果（注意周围的括号）——这有助于防止在尝试这样做时常见的错误。

更新：提问者请求一个选择所有所需文本节点的单个XPath表达式——这里就是：

/*/p/text()[1]
   |
    (//span[@class='title2'])[1]/text()
   |
    (//span[@class='title2'])[1]/following-sibling::a[1]/text()
   |
    (//span[@class='title2'])[2]/text()
   |
    (//span[@class='title2'])[2]/following-sibling::a[1]/text()

当应用于与上面相同的XML文档时，文本节点的连接正是所需的：

ANT101H5 Introduction to Biological Anthropology and Archaeology          
        Exclusion: ANT100Y5Prerequisite: ANT102H5

这个结果可以通过运行以下XSLT转换来确认：

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "/*/p/text()[1]
   |
    (//span[@class='title2'])[1]/text()
   |
    (//span[@class='title2'])[1]/following-sibling::a[1]/text()
   |
    (//span[@class='title2'])[2]/text()
   |
    (//span[@class='title2'])[2]/following-sibling::a[1]/text()
   "/>
 </xsl:template>
</xsl:stylesheet>

当这个转换应用于之前指定的同一个XML文档时，会产生想要的正确结果：

ANT101H5 Introduction to Biological Anthropology and Archaeology          
        Exclusion: ANT100Y5Prerequisite: ANT102H5

最后：以下单个XPath表达式准确选择了HTML页面中所有想要的文本节点，使用提供的链接（在整理成良好格式的XML后）：

  (//p[@class='titlestyle'])[2]/text()[1]
|
  (//span[@class='title2'])[2]/text()
|
  (//span[@class='title2'])[2]/following-sibling::a[1]/text()
|
  (//span[@class='title2'])[3]/text()
|
  (//span[@class='title2'])[3]/following-sibling::a[1]/text()

回答于 2025-04-16 由 Python大师

分享举报

XPath：通过当前节点属性选择当前和下一个节点的文本

3 个回答

撰写回答