<p>下面是一个使用<a href="https://scrapy.readthedocs.org/" rel="nofollow">Scrapy</a>的解决方案。看看<a href="https://scrapy.readthedocs.org/en/0.16/intro/overview.html" rel="nofollow">overview</a>,你就会明白,它是为这类任务设计的工具:</p>
<ul>
<li>它很快(基于twisted)</li>
<li>易于使用和理解</li>
<li>基于xpath的内置提取机制(但是也可以使用<code>bs</code>或{<cd2>})</li>
<li>内置支持将提取的项目管道化到数据库、xml、json等等</li>
<li>还有更多的功能</li>
</ul>
<hr/>
<p>这是一个工作蜘蛛,它可以提取你所要求的一切(在我那台相当旧的笔记本电脑上工作了15分钟):</p>
<pre><code>import datetime
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class BillBoardItem(Item):
date = Field()
song = Field()
artist = Field()
BASE_URL = "http://www.billboard.com/charts/%s/hot-100"
class BillBoardSpider(BaseSpider):
name = "billboard_spider"
allowed_domains = ["billboard.com"]
def __init__(self):
date = datetime.date(year=1958, month=8, day=9)
self.start_urls = []
while True:
if date.year >= 2013:
break
self.start_urls.append(BASE_URL % date.strftime('%Y-%m-%d'))
date += datetime.timedelta(days=7)
def parse(self, response):
hxs = HtmlXPathSelector(response)
date = hxs.select('//span[@class="chart_date"]/text()').extract()[0]
songs = hxs.select('//div[@class="listing chart_listing"]/article')
for song in songs:
item = BillBoardItem()
item['date'] = date
try:
item['song'] = song.select('.//header/h1/text()').extract()[0]
item['artist'] = song.select('.//header/p[@class="chart_info"]/a/text()').extract()[0]
except:
continue
yield item
</code></pre>
<p>将其保存到<code>billboard.py</code>并通过<code>scrapy runspider billboard.py -o output.json</code>运行。然后,在<code>output.json</code>中,您将看到:</p>
^{pr2}$
<p>另外,请看一下<a href="https://github.com/kennethreitz/grequests" rel="nofollow">grequests</a>作为一个替代工具。在</p>
<p>希望有帮助。在</p>