我正在使用CrawlSpider和rule linkextractor。我想用下面的代码动态设置引用外部文件的start_URL,但它不起作用
以下是错误信息和我的代码。 这可能是一个简单/愚蠢的问题,但如果有人能给我一个提示来解决我的问题,我真的很感激。先谢谢你
Traceback (most recent call last):
File "/home/ec2-user/venv/lib64/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/home/ec2-user/venv/lib64/python3.7/site-packages/scrapy/crawler.py", line 88, in crawl
start_requests = iter(self.spider.start_requests())
TypeError: 'NoneType' object is not iterable
class StorelistSpider(CrawlSpider):
name = "crawler"
allowed_domains = ["example.com"]
def start_requests(self):
#Target Category
with open('CategoryList.txt') as f1:
for q1 in f1:
targetCategory = q1
#Target Prefecture
with open('prefectureList.txt') as f2:
for q2 in f2:
prefectureName = q2
start_urls=("https://example.com/" + q2 + "/")
#rules to follow links:
rules = (
#follow area link first, then category link next, check list pages and go to the details
Rule(LinkExtractor(
allow=r"/\w+/A\d{4}/$",
restrict_xpaths = "//*[@id='js-leftnavi-area-scroll']",
unique = True,)),
Rule(LinkExtractor(
allow=r"/\w+/A\d{4}/rstLst/" + "{}".format(targetCategory) + r"/$",
restrict_xpaths = "//*[@id='js-leftnavi-genre-balloon']",
unique = True,)),
Rule(LinkExtractor(
allow=r"/\w+/A\d{4}/rstLst/" + "{}".format(targetCategory) + r"/\d*/$",
restrict_xpaths = "//*[@id='container']/div[15]/div[4]/div/div[7]/div/ul",
unique = True,)),
Rule(LinkExtractor(
allow=r"/\w+/A\d{4}/A\d{6}/\d+/$",
restrict_xpaths = "//*[@id='container']/div[15]/div[4]/div/div[6]",
unique = True,
), callback="page_parse"),
)
def page_parse(self, response):
yield Page.from_response(response)
start_requests
方法必须返回迭代器。请参阅https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests另外,
rules
必须声明为类属性。请参阅https://docs.scrapy.org/en/latest/topics/spiders.html#crawlspider-example因此,您的示例应该如下所示:
我还没有测试,但这应该可以工作(编辑:还需要更新规则以删除
targetCategory
变量。您可以静态编写这些规则,或者在__init__
方法上构建它。)相关问题 更多 >
编程相关推荐