Scrapy递归爬虫问题
我正在尝试做一个递归爬虫,从一个网站(比如:web.com)提取内容。这个网站的链接结构是特定的。例如:
http://web.com/location/profile/12345678?qid=1403226397.5971&source=location&rank=21
http://web.com/location/profile/98765432?qid=1403366850.3991&source=locaton&rank=1
你可以看到,链接中只有数字部分在变化,我需要爬取所有符合这个链接结构的链接,并提取itemX、itemY和itemZ。
我把这个链接结构转化成了正则表达式,写成这样:'\d+?qid=\d+.\d+&source=location&rank=\d+'。然后我用Python-Scrapy写了代码,但在运行爬虫后,什么都没有提取到:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from web.items import webItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy import log
import re
import urllib
class web_RecursiveSpider(CrawlSpider):
name = "web_RecursiveSpider"
allowed_domains = ["web.com"]
start_urls = ["http://web.com/location/profile",]
rules = (Rule (SgmlLinkExtractor(allow=('\d+?qid=\d+.\d+&source=location&rank=\d+', ),)
, callback="parse_item", follow= True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//*')
items = []
for site in sites:
item = webItem()
item["itemX"] = site.select("//span[@itemprop='X']/text()").extract()
item["itemY"] = site.select("//span[@itemprop='Y']/text()").extract()
item["itemZ"] = site.select("//span[@itemprop='Z']/text()").extract()
items.append(item)
return items
1 个回答
1
你需要在正则表达式中对?
这个符号进行转义:
'\d+\?qid=\d+.\d+&source=location&rank=\d+'
^
示例:
>>> import re
>>> url = "http://web.com/location/profile/12345678?qid=1403226397.5971&source=location&rank=21"
>>> print re.search('\d+?qid=\d+.\d+&source=location&rank=\d+', url)
None
>>> print re.search('\d+\?qid=\d+.\d+&source=location&rank=\d+', url)
<_sre.SRE_Match object at 0x10be538b8>
注意,你还需要对点号进行转义,不过这不会影响你提供的例子:
'\d+\?qid=\d+\.\d+&source=location&rank=\d+'
^