使用Scrapy+Selenium+PhantopJS lost d对数据进行爬网

import scrapy class GovSpider(scrapy.Spider): name = 'gov' url = "http://www.sse.com.cn/assortment/stock/list/share/" headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" } driver = webdriver.PhantomJS('/Users/luozhongjin/ScrapyDemo/ScrapyDemo/phantomjs') driver.implicitly_wait(15) def start_requests(self): yield scrapy.Request(url = self.url, headers = self.headers,callback = self.parse); def parse(self, response): self.driver.get(response.url) self.driver.set_window_size(1124, 850) i = 1 while True: soup = BeautifulSoup(self.driver.page_source, 'lxml') trs = soup.findAll("tr") for tr in trs: try: tds = tr.findAll("td") print tds item = GovSpiderItem() item["name"] = tds[1].string print ("ok") yield item except: pass try: next_page = self.driver.find_element_by_class_name("glyphicon-menu-right").click() i = i + 1 if i >= 55: break except: break

1条回答

网友

1楼 · 发布于 2024-04-23 15:21:11

jQuery.active是当前AJAX请求的数目。因此驱动程序将等待ajax请求完成。但是解析响应和呈现数据需要一些时间。在

ajax complete -> render the data -> html source updated

如果驱动程序试图在渲染完成之前获取源，它将丢失一些数据。我会选择一个条件来检查元素值。在这里，所有的数据都必须是升序的，因为所有的数据都必须大于最大值

^{pr2}$

数据丢失的另一个可能原因是driver.implicitly_wait(15)可能无法在此处工作，如文档所述：

An implicit wait tells WebDriver to poll the DOM for a certain amount of time when trying to find any element (or elements) not immediately available. The default setting is 0. Once set, the implicit wait is set for the life of the WebDriver object.

在这里，您将driver.page_source输入BeautifulSoup，而不是{}，因此{}将不会被触发，它可能会跳过第1页。在这里，我将使用另一个条件来检查：

return document.getElementsByTagName("td").length > 0;

测试代码：

import scrapy
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait


class GovSpider(scrapy.Spider):
    name = 'gov'

    url = "http://www.sse.com.cn/assortment/stock/list/share/"

    driver = webdriver.Chrome()
    driver.set_window_size(1124, 850)

    def start_requests(self):
        yield scrapy.Request(url=self.url, callback=self.parse)

    def parse(self, response):
        i = 1
        current_max = 0

        self.driver.get(response.url)
        WebDriverWait(self.driver, 10).until(
            lambda driver: self.driver.execute_script('return document.getElementsByTagName("td").length > 0;'))

        while True:
            soup = BeautifulSoup(self.driver.page_source, 'lxml')
            trs = soup.findAll("tr")
            for tr in trs:
                try:
                    tds = tr.findAll("td")
                    stock_id = int(tds[0].string)
                    current_max = max(current_max, stock_id)
                    yield {
                        'page num': i,
                        'stock id': tds[0].string
                    }
                except:
                    pass
            try:
                self.driver.find_element_by_class_name("glyphicon-menu-right").click()

                js_condition_tpl = 'return {} < parseInt(document.getElementsByTagName("td")[0].children[0].text);'
                WebDriverWait(self.driver, 10).until(
                    lambda driver: self.driver.execute_script(js_condition_tpl.format(current_max)))

                i = i + 1
                if i >= 55:
                    break
            except:
                break

PS：如果您只需要数据本身，页面中有一个xls下载链接，这是一种更可靠、更容易获取数据的方法。在

相关问题更多 >

编程相关推荐

热门问题

热门文章