Scrapy. 从 div 中提取 html，不加父标签包裹

3 投票

2 回答

3958 浏览

提问于 2025-04-17 20:15

我在用scrapy这个工具爬取一个网站。

我想提取某个特定的div里的内容。

<div class="short-description">
{some mess with text, <br>, other html tags, etc}
</div>

loader.add_xpath('short_description', "//div[@class='short-description']/div")

用这段代码我能拿到我想要的内容，但结果里包含了外层的html标签（<div class="short-description">...</div>）

我该怎么去掉这个外层的html标签呢？

注意：像text()、node()这样的选择器对我没用，因为我的div里有<br>、<p>、其他div等等，还有空格，我需要保留这些东西。

数据提取 html解析网页爬虫 scrapy 标签处理 div元素

2 个回答

这段代码的意思是……

首先，它定义了一些变量，这些变量就像是用来存储信息的盒子。你可以把它们想象成你在写作业时用来记录答案的地方。

接下来，代码中有一些操作，这些操作就像是你在做数学题时的步骤。每一步都会对这些盒子里的信息进行处理，最终得到一个结果。

最后，代码会输出结果，这就像是你把作业的答案写在纸上，准备交给老师。

总的来说，这段代码就是在告诉计算机如何处理一些信息，并给出一个最终的答案。

hxs = HtmlXPathSelector(response)
for text in hxs.select("//div[@class='short-description']/text()").extract(): 
    print text

回答于 2025-04-17 由 Python大师

分享举报

试试把 node() 和 Join() 一起用：

loader.get_xpath('//div[@class="short-description"]/node()', Join())

然后结果看起来像这样：

>>> from scrapy.contrib.loader import XPathItemLoader
>>> from scrapy.contrib.loader.processor import Join
>>> from scrapy.http import HtmlResponse
>>>
>>> body = """
...     <html>
...         <div class="short-description">
...             {some mess with text, <br>, other html tags, etc}
...             <div>
...                 <p>{some mess with text, <br>, other html tags, etc}</p>
...             </div>
...             <p>{some mess with text, <br>, other html tags, etc}</p>
...         </div>
...     </html>
... """
>>> response = HtmlResponse(url='http://example.com/', body=body)
>>>
>>> loader = XPathItemLoader(response=response)
>>>
>>> print loader.get_xpath('//div[@class="short-description"]/node()', Join())

            {some mess with text,  <br> , other html tags, etc}
             <div>
                <p>{some mess with text, <br>, other html tags, etc}</p>
            </div>
             <p>{some mess with text, <br>, other html tags, etc}</p>
>>>
>>> loader.get_xpath('//div[@class="short-description"]/node()', Join())
u'\n            {some mess with text,  <br> , other html tags, etc}\n
   <div>\n         <p>{some mess with text, <br>, other html tags, etc}</p>\n
   </div> \n     <p>{some mess with text, <br>, other html tags, etc}</p> \n'

回答于 2025-04-17 由 Python大师

分享举报

Scrapy. 从 div 中提取 html，不加父标签包裹

2 个回答

撰写回答