匹配结果碎片的html输出(跳过第一个匹配)

2024-04-25 22:33:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我有现有的scrapy代码,但在制定NEXT_PAGE_SELECTOR时遇到困难,该代码将通过scrapy中的css select选择元素:

def parse(self, response):
'''
        get the first page of results.
    '''
    SET_SELECTOR = 'b_algo'
    for bresult in response.css(SET_SELECTOR):
        NAME_SELECTOR = 'h2 a ::text'
        yield {
            'name': bresult.css(NAME_SELECTOR).extract_first(),
        }

    '''
        get the further pages of results.
    '''
    #<<NEXT_PAGE_SELECTOR here>>

html Im试图匹配的是:

^{pr2}$

为了配合这一点,我制定了以下公式:

NEXT_PAGE_SELECTOR = '.sb_pagF li a ::attr(href)'

这看起来对抓取href?在

谢谢!在


Tags: ofthe代码namegetresponsepageselector
2条回答

是的,这是正确的:

$ scrapy shell
In[1]: foo = """<ul class="sb_pagF" aria-label="More pages with results">
<li>
          <a title="Next page" class="sb_pagN" href="/search?q=site%3asite.com&amp;first=11&amp;FORM=PORE">
            <div class="sw_next">Next
            </div>
          </a>
</li>
</ul>"""
In [2]: from scrapy import Selector
In [3]: sel = Selector(text=foo)
In [4]: sel.css('.sb_pagF li a ::attr(href)').extract()
Out[1]: [u'/search?q=site%3asite.com&first=11&FORM=PORE']

您可以始终在指向本地html的Scrapy Shell中测试选择器:

$ cat index.html
<ul class="sb_pagF" aria-label="More pages with results">
    <li>
        <a title="Next page" class="sb_pagN" href="/search?q=site%3asite.com&amp;first=11&amp;FORM=PORE">
            <div class="sw_next">Next
            </div>
        </a>
    </li>
</ul>
$ scrapy shell file://$PWD/index.html
In [1]: response.css(".sb_pagF li a ::attr(href)").extract_first()
Out[1]: u'/search?q=site%3asite.com&first=11&FORM=PORE'

相关问题 更多 >