xpath 获取多行文本

0 投票

2 回答

2073 浏览

提问于 2025-04-17 23:44

我有这个HTML代码

<td width="70%">REGEN REAL ESTATE, Dubai – U.A.E

RERA ID: 12087

Specialist Licensed Property Brokers &amp; Consultants
Residential / Commercial – Buying, Selling, R <a href="http://www.justproperty.com/company_view/index/3963">...Read more...</a></td>

我想获取所有在td标签里的文字

我尝试过什么呢？

normalize-space(td/text())

但是我只得到了最后一行。

我该怎么做才能获取所有的行呢？

xpath HTML 文本提取多行内容

2 个回答

normalize-space(//td/text()) 对我来说是有效的。

演示（使用 xmllint）：

$ xmllint input.xml --xpath "normalize-space(//td/text())"
REGEN REAL ESTATE, Dubai – U.A.E RERA ID: 12087 Specialist Licensed Property Brokers & Consultants Residential / Commercial – Buying, Selling, R

这里的 input.xml 是你提供的那个 xml 文件。

回答于 2025-04-17 由 Python大师

分享举报

你可以使用 u"".join(selector.xpath('.//td//text()').extract()) 或者 u"".join(selector.css('td ::text').extract()) 这两种方式来提取文本内容。

我差点忘了最简单的方法，如果你想获取某个特定节点的所有文本内容，可以直接在它上面使用 normalize-space()：

paul@wheezy:~$ ipython
Python 2.7.3 (default, Jan  2 2013, 13:56:14) 
Type "copyright", "credits" or "license" for more information.

IPython 0.13.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from scrapy.selector import Selector

In [2]: selector = Selector(text="""<td width="70%">REGEN REAL ESTATE, Dubai – U.A.E
   ...: 
   ...: RERA ID: 12087
   ...: 
   ...: Specialist Licensed Property Brokers &amp; Consultants
   ...: Residential / Commercial – Buying, Selling, R <a href="http://www.justproperty.com/company_view/index/3963">...Read more...</a></td>""", type="html")

In [3]: selector.xpath("normalize-space(.//td)")
Out[3]: [<Selector xpath='normalize-space(.//td)' data=u'REGEN REAL ESTATE, Dubai \u2013 U.A.E RERA ID'>]

In [4]: selector.xpath("normalize-space(.//td)").extract()
Out[4]: [u'REGEN REAL ESTATE, Dubai \u2013 U.A.E RERA ID: 12087 Specialist Licensed Property Brokers & Consultants Residential / Commercial \u2013 Buying, Selling, R ...Read more...']

In [5]: [td.xpath("normalize-space(.)").extract() for td in selector.css("td")]
Out[5]: [[u'REGEN REAL ESTATE, Dubai \u2013 U.A.E RERA ID: 12087 Specialist Licensed Property Brokers & Consultants Residential / Commercial \u2013 Buying, Selling, R ...Read more...']]

In [7]:

记住，normalize-space() 只会考虑你给的节点集合中的第一个节点，所以如果你确定你的参数只会匹配到一个你想要的节点，它通常会按你想要的方式工作。

回答于 2025-04-17 由 Python大师

分享举报

xpath 获取多行文本

我尝试过什么呢？

2 个回答

撰写回答