xpath从多行获取文本

2024-05-15 08:56:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这个html

<td width="70%">REGEN REAL ESTATE, Dubai – U.A.E

RERA ID: 12087

Specialist Licensed Property Brokers &amp; Consultants
Residential / Commercial – Buying, Selling, R <a href="http://www.justproperty.com/company_view/index/3963">...Read more...</a></td>

我想得到td中的所有文本

我试过什么?

^{pr2}$

但我只有最后一行。在

我该怎么做才能得到所有的台词?在


Tags: idhtmlpropertywidthrealtdamplicensed
2条回答

您可以使用u"".join(selector.xpath('.//td//text()').extract())u"".join(selector.css('td ::text').extract())

我几乎忘记了最简单的方法,如果您想要特定节点的每个文本内容,可以直接在其上使用normalize-space()

paul@wheezy:~$ ipython
Python 2.7.3 (default, Jan  2 2013, 13:56:14) 
Type "copyright", "credits" or "license" for more information.

IPython 0.13.1   An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from scrapy.selector import Selector

In [2]: selector = Selector(text="""<td width="70%">REGEN REAL ESTATE, Dubai – U.A.E
   ...: 
   ...: RERA ID: 12087
   ...: 
   ...: Specialist Licensed Property Brokers &amp; Consultants
   ...: Residential / Commercial – Buying, Selling, R <a href="http://www.justproperty.com/company_view/index/3963">...Read more...</a></td>""", type="html")

In [3]: selector.xpath("normalize-space(.//td)")
Out[3]: [<Selector xpath='normalize-space(.//td)' data=u'REGEN REAL ESTATE, Dubai \u2013 U.A.E RERA ID'>]

In [4]: selector.xpath("normalize-space(.//td)").extract()
Out[4]: [u'REGEN REAL ESTATE, Dubai \u2013 U.A.E RERA ID: 12087 Specialist Licensed Property Brokers & Consultants Residential / Commercial \u2013 Buying, Selling, R ...Read more...']

In [5]: [td.xpath("normalize-space(.)").extract() for td in selector.css("td")]
Out[5]: [[u'REGEN REAL ESTATE, Dubai \u2013 U.A.E RERA ID: 12087 Specialist Licensed Property Brokers & Consultants Residential / Commercial \u2013 Buying, Selling, R ...Read more...']]

In [7]: 

请记住,normalize-space()将只考虑作为参数的节点集中的第一个节点,因此,如果您确定参数将匹配您想要的一个节点,那么它通常会执行您想要的操作。在

normalize-space(//td/text())适合我。在

演示(使用xmlint):

$ xmllint input.xml  xpath "normalize-space(//td/text())"
REGEN REAL ESTATE, Dubai – U.A.E RERA ID: 12087 Specialist Licensed Property Brokers & Consultants Residential / Commercial – Buying, Selling, R

其中input.xml包含您提供的xml。在

相关问题 更多 >