将Selenium HTML字符串传递给Scrapy以添加需抓取的URL列表

2 投票
1 回答
1491 浏览
提问于 2025-04-18 06:24

我刚接触Python、Scrapy和Selenium,所以任何帮助都非常感谢。

我想把从Selenium获取的HTML页面源代码处理成Scrapy的响应对象。这样做的主要目的是为了把Selenium网页源代码中的网址添加到Scrapy要解析的网址列表中。

再次感谢大家的帮助。

还有一个小问题,大家知道怎么查看Scrapy找到并抓取的网址列表吗?

谢谢!

*******编辑*******

这是我想做的一个例子。我在第5部分遇到了困难。

class AB_Spider(CrawlSpider):
    name = "ab_spider"
    allowed_domains = ["abcdef.com"]
    #start_urls = ["https://www.kickstarter.com/projects/597507018/pebble-e-paper-watch-for-iphone-and-android"
    #, "https://www.kickstarter.com/projects/801465716/03-leagues-under-the-sea-the-seaquestor-flyer-subm"]
    start_urls = ["https://www.abcdef.com/page/12345"]

    def parse_abcs(self, response):
        sel = Selector(response)
        URL = response.url

        #part 1: check if a certain element is on the webpage
        last_chk = sel.xpath('//ul/li[@last_page="true"]')
        a_len = len(last_chk)

        #Part 2: if not, then get page via selenium webdriver
        if a_len == 0:
            #OPEN WEBDRIVER AND GET PAGE
            driver = webdriver.Firefox()
            driver.get(response.url)    

        #Part 3: run script to ineract with page until certain element appears
        while a_len == 0:
            print "ELEMENT NOT FOUND, USING SELENIUM TO GET THE WHOLE PAGE"

            #scroll down one time
            driver.execute_script("window.scrollTo(0, 1000000000);")

            #get page source and check if last page is there
            selen_html = driver.page_source
            hxs = Selector(text=selen_html)
            last_chk = hxs.xpath('//ul/li[@last_page="true"]')

            a_len = len(last_chk)

        driver.close()

        #Part 4: extract the URLs in the selenium webdriver URL 
        all_URLS = hxs.xpath('a/@href').extract()

        #Part 5: all_URLS add to the Scrapy URLs to be scraped

1 个回答

1

只需从这个方法中返回yield Request 实例,并提供一个回调函数:

class AB_Spider(CrawlSpider):
    ...

    def parse_abcs(self, response):
        ...

        all_URLS = hxs.xpath('a/@href').extract()

        for url in all_URLS:
            yield Request(url, callback=self.parse_page)

    def parse_page(self, response):
        # Do the parsing here

撰写回答