在shell和spid中处理ajax连续响应数据

2条回答

网友

1楼 · 编辑于 2024-05-23 21:10:25

获取html内容后，可以初始化选择器对象以使用xpath选择器：

from scrapy.selector import Selector
import json

response_json = json.loads(response.body_as_unicode())
html = response_json['content_html']
sel = Selector(text=html)
for url in sel.xpath('//@href').extract():
    yield Request(url, callback=self.somecallbackfunction)

网友

2楼 · 编辑于 2024-05-23 21:10:25

下面是废选择器的文档：http://doc.scrapy.org/en/1.1/topics/selectors.html

我也遇到过同样的问题。我用选择器处理。您可以通过响应或字符串构造选择器，然后可以使用“xpath”。在

另外，您可以使用try...except...来标识响应的类型（html或json）

def parse(self, response):
    try:
        jsonresponse = json.loads(response.body_as_unicode())
        html = jsonresponse['content_html'].strip()
        sel = Selector(text=html)
    except:
        sel = Selector(response=response)

    entries = sel.xpath(
        '//li[contains(@class,"feed-item-container")]')
    for entry in entries:
        try:
            title = entry.xpath('.//h3/a/text()').extract()[0]
            item = YoutubeItem()
            item['title'] = title
            yield item
        except Exception as err:
            continue

    try:
        jsonresponse = json.loads(response.body_as_unicode())
        sel = Selector(text=jsonresponse['load_more_widget_html'])
    except:
        sel = Selector(response=response)
    try:
        url = "https://www.youtube.com" + \
            sel.xpath(
                '//button[contains(@class,"load-more-button")]/@data-uix-load-more-href').extract()[0]
        req = scrapy.Request(url, callback=self.parse)
        yield req
    except:
        self.log('Scawl completed.')

相关问题更多 >

编程相关推荐

热门问题

热门文章

在shell和spid中处理ajax连续响应数据

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >