我的蜘蛛代码是:
class TryItem(Item):
url = Field()
class BbcSpiderSpider(CrawlSpider):
name = "bbc_spider"
allowed_domains = ["www.bbc.com"]
start_urls = ['http://www.bbc.com/sport/0/tennis']
rules = (Rule(LinkExtractor(allow=['.*sport\/0\/tennis\/\d{8}']), callback='parse_item', follow=True),)
def parse_item(self, response):
Item = TryItem()
Item['url'] = response.url
yield Item
通过这个蜘蛛,我试图收集所有关于网球的文章的网址。我使用csv代码:
scrapy crawl bbc_spier -o bbc.csv -t csv
我想要的结果是:
http://www.bbc.com/sport/0/tennis/34322294
http://www.bbc.com/sport/0/tennis/14322295
...
http://www.bbc.com/sport/0/tennis/12345678
但是,spider也会返回不匹配的URL,例如:
http://www.bbc.com/sport/0/tennis/29604652?print=true
http://www.bbc.com/sport/0/tennis/34252190?comments_page=11&filter=none&initial_page_size=10&sortBy=Created&sortOrder=Descending
有什么建议吗?谢谢
不要让爬行器通过强制url在8位之后结束来跟踪不需要的url:
相关问题 更多 >
编程相关推荐