如何处理带有灰显下拉框的scrapy FormRequest表单

0 投票

1 回答

1375 浏览

提问于 2025-04-17 10:57

我正在尝试从Gasbuddy.com上抓取一些汽车信息，但在使用scrapy代码时遇到了一些问题。

这是我目前的代码，看看我哪里做错了：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.http import Request
from scrapy.http import FormRequest

class gasBuddy(BaseSpider):
name = "gasBuddy"
allowed_domains = ["http://www.gasbuddy.com"]
start_urls = [
    "http://www.gasbuddy.com/Trip_Calculator.aspx",
]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    #for years in hxs.select('//select[@id="ddlYear"]/option/text()'):
        #print years
    FormRequest(url="http://www.gasbuddy.com/Trip_Calculator.aspx",
                formdata={'Year': '%s'%("2011")},
                callback=self.make('2011'))


def make (years, self, response):
    #this is where we loop through all of the car makes and send the response to modle
    hxs = HtmlXPathSelector(response)
    for makes in hxs.select('//select[@id="ddlMake"]/option/text()').extract()
        FormRequest(url="http://www.gasbuddy.com/Trip_Calculator.aspx",
                formdata={'Year': '%s', 'Make': '%s'%(years, makes)},
                callback=self.model(years, makes))


def model (years, makes, self, response):
    #this is where we loop through all of the car modles and get all of the data assoceated with it.
    hxs = HtmlXPathSelector(response)
    for models in hxs.select('//select[@id="ddlModel"]/option/text()')
        FormRequest(url="http://www.gasbuddy.com/Trip_Calculator.aspx",
                formdata={'Year': '%s', 'Make': '%s', 'Model': '%s'%(years, makes, models)},
                callback=self.model(years, makes))

        print hxs.select('//td[@id="tdCityMpg"]/text()')

我这个代码的基本想法是选择一个表单字段，然后调用一个表单请求，再通过一个回调函数继续循环，直到处理到最后一个字段，然后开始读取每辆车的信息。但是我一直遇到几个错误……其中一个是说gasbuddy没有'encoding'这个属性（我对这个完全不明白）。我也不确定是否可以给回调函数传递参数。

任何帮助都会非常感激。

data extraction web scraping callback function scrapy encoding error formrequest gasbuddy

1 个回答

这个回答主要讲的是如何在调用回调函数时传递额外的参数，并没有解决你网站上动态表单的问题。

如果你想给回调函数传递额外的参数，可以使用标准Python库中的functools.partial。

下面是一个不涉及Scrapy的简化示例：

import functools


def func(self, response):
    print self, response

def func_with_param(self, response, param):
    print self, response, param    

def caller(callback):
    callback('self', 'response')

caller(func)
caller(functools.partial(func_with_param, param='param'))

所以你应该这样定义make和model函数（self总是第一个参数）：

def make (self, response, years):
    ...

def model (self, response, years, makes):
    ...

还有回调参数：

import functools
...

def parse(self, response):
    ...
    return FormRequest(url="http://www.gasbuddy.com/Trip_Calculator.aspx",
                       formdata={'Year': '%s'%("2011")},
                       callback=functools.partial(self.make, years='2011'))

在Scrapy中，另一个传递参数给回调函数的选项是使用meta参数来处理FormRequest。

例如：

def parse(self, response):
    ...
    return FormRequest(url="http://www.gasbuddy.com/Trip_Calculator.aspx",
                       formdata={'Year': '%s'%("2011")},
                       meta={'years':'2011'},
                       callback=self.make)

def make (self, response):
    years = response.meta['years']
    ...

对于models也是类似的。

你代码中的另一个问题是，FormRequest只被创建了，但没有被使用。你应该返回它们（就像我在parse示例中那样）或者在循环中用yield返回：

for something in hxs.select(...).extract():
    yield FormRequest(...)

回答于 2025-04-17 由 Python大师

分享举报

如何处理带有灰显下拉框的scrapy FormRequest表单

1 个回答

撰写回答