python scrapy:抓取动态信息

1 投票
1 回答
782 浏览
提问于 2025-04-18 08:51

我正在尝试从这个网站 http://www.qchp.org.qa/en/Pages/searchpractitioners.aspx 抓取信息。我想做以下几件事:

  • 从页面顶部的下拉菜单中选择“牙医”
  • 点击搜索
  • 注意页面底部的信息会动态变化,这是通过javascript实现的
  • 点击医生名字的超链接,会弹出一个窗口
  • 我想把每位医生的信息保存到一个json或csv文件中
  • 我还想获取页面底部链接的其他页面的信息,这些信息会改变保存的内容。

我对scrapy非常陌生,刚刚开始了解selenium,因为我看到有人说抓取动态信息需要用到selenium。

所以我在一个scrapy应用中使用了Selenium。不确定这样做是否正确。我对最佳的实现方式一无所知。目前我有以下代码,但遇到了这个错误:sch_spider.py

line 21, in DmozSpider
    all_options = element.find_elements_by_tag_name("option")
NameError: name 'element' is not defined

sch_spider.py

from scrapy.spider import Spider
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from scrapytutorial.items import SchItem
from selenium.webdriver.support.ui import Select

class DmozSpider(Spider):
    name = "sch"

    driver = webdriver.Firefox()
    driver.get("http://www.qchp.org.qa/en/Pages/searchpractitioners.aspx")
    select = Select(driver.find_element_by_name('ctl00$m$g_28bc0e11_4b8f_421f_84b7_d671de504bc3$ctl00$drp_practitionerType'))
    all_options = element.find_elements_by_tag_name("option")

    for option in all_options:
        if option.get_attribute("value") == "4":  #Dentist
            option.click()
        ends
        break

    driver.find_element_by_name("ctl00$m$g_28bc0e11_4b8f_421f_84b7_d671de504bc3$ctl00$Searchbtn").click()


    def parse(self, response):

        all_docs = element.find_elements_by_tag_name("td")
        for name in all_docs:
            name.click()
            alert = driver.switch_to_alert()
            sel = Selector(response)
            ma = sel.xpath('//table')
            items = []
            for site in ma:
                item = SchItem()
                item['name'] = site.xpath("//span[@id='PractitionerDetails1_lbl_Name']/text()").extract()
                item['profession'] = site.xpath("//span[@id='PractitionerDetails1_lbl_Profession']/text()").extract()
                item['scope_of_practise'] = site.xpath("//span[@id='PractitionerDetails1_lbl_sop']/text()").extract()
                item['instituition'] = site.xpath("//span[@id='PractitionerDetails1_lbl_institution']/text()").extract()
                item['license'] = site.xpath("//span[@id='PractitionerDetails1_lbl_LicenceNo']/text()").extract()
                item['license_expiry_date'] = site.xpath("//span[@id='PractitionerDetails1_lbl_LicenceExpiry']/text()").extract()
                item['qualification'] = site.xpath("//span[@id='PractitionerDetails1_lbl_Qualification']/text()").extract()

                items.append(item)
            return items

items.py

from scrapy.item import Item, Field

class SchItem(Item):

    name = Field()
    profession = Field()
    scope_of_practise = Field()
    instituition = Field()
    license = Field()
    license_expiry_date = Field()
    qualification = Field()

1 个回答

0

你不应该把下面代码中的 element.find_elements 改成 select.find_element 吗?

  select = Select(driver.find_element_by_name('ctl00$m$g_28bc0e11_4b8f_421f_84b7_d671de504bc3$ctl00$drp_practitionerType'))
  all_options = element.find_elements_by_tag_name("option")

或者说,应该不应该用 select.options 呢?

撰写回答