如何使用Urllib2更有效地刮取？

2条回答

网友

1楼 · 编辑于 2024-05-23 17:47:37

下面是一个使用Scrapy的解决方案。看看overview，你就会明白，它是为这类任务设计的工具：

它很快（基于twisted）
易于使用和理解
基于xpath的内置提取机制（但是也可以使用bs或{}）
内置支持将提取的项目管道化到数据库、xml、json等等
还有更多的功能

这是一个工作蜘蛛，它可以提取你所要求的一切（在我那台相当旧的笔记本电脑上工作了15分钟）：

import datetime
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class BillBoardItem(Item):
    date = Field()
    song = Field()
    artist = Field()


BASE_URL = "http://www.billboard.com/charts/%s/hot-100"


class BillBoardSpider(BaseSpider):
    name = "billboard_spider"
    allowed_domains = ["billboard.com"]

    def __init__(self):
        date = datetime.date(year=1958, month=8, day=9)

        self.start_urls = []
        while True:
            if date.year >= 2013:
                break

            self.start_urls.append(BASE_URL % date.strftime('%Y-%m-%d'))
            date += datetime.timedelta(days=7)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        date = hxs.select('//span[@class="chart_date"]/text()').extract()[0]

        songs = hxs.select('//div[@class="listing chart_listing"]/article')
        for song in songs:
            item = BillBoardItem()
            item['date'] = date
            try:
                item['song'] = song.select('.//header/h1/text()').extract()[0]
                item['artist'] = song.select('.//header/p[@class="chart_info"]/a/text()').extract()[0]
            except:
                continue

            yield item

将其保存到billboard.py并通过scrapy runspider billboard.py -o output.json运行。然后，在output.json中，您将看到：

^{pr2}$

另外，请看一下grequests作为一个替代工具。在

希望有帮助。在

网友

2楼 · 编辑于 2024-05-23 17:47:37

你的瓶颈几乎肯定是从网站获取数据。每个网络请求都有延迟，这会阻止同时发生其他任何事情。您应该考虑跨多个线程拆分请求，以便一次可以发送多个请求。基本上，这里的性能受I/O限制，而不是CPU限制。在

这里有一个简单的解决方案，你可以看到爬虫一般是如何工作的。从长远来看，使用Scrapy之类的东西可能是最好的，但我发现从简单明了的东西开始总是有帮助的。在

import threading
import Queue
import time
import datetime
import urllib2
import re

class Crawler(threading.Thread):
    def __init__(self, thread_id, queue):
        threading.Thread.__init__(self)
        self.thread_id = thread_id
        self.queue = queue

        # let's use threading events to tell the thread when to exit
        self.stop_request = threading.Event()

    # this is the function which will run when the thread is started
    def run(self):
        print 'Hello from thread %d! Starting crawling...' % self.thread_id

        while not self.stop_request.isSet():
            # main crawl loop

            try:
                # attempt to get a url target from the queue
                url = self.queue.get_nowait()
            except Queue.Empty:
                # if there's nothing on the queue, sleep and continue
                time.sleep(0.01)
                continue

            # we got a url, so let's scrape it!
            response = urllib2.urlopen(url) # might want to consider adding a timeout here
            htmlText = response.read()

            # scraping with regex blows.
            # consider using xpath after parsing the html using lxml.html module
            song = re.findall('<h1>.*</h1>', htmlText)[0]
            song = song[4:-5]
            artist = re.findall('/artist.*</a>', htmlText)[1]
            artist = re.findall('>.*<', artist)[0]
            artist = artist[1:-1]

            print 'thread %d found artist:', (self.thread_id, artist)

    # we're overriding the default join function for the thread so
    # that we can make sure it stops
    def join(self, timeout=None):
        self.stop_request.set()
        super(Crawler, self).join(timeout)

if __name__ == '__main__':
    # how many threads do you want?  more is faster, but too many
    # might get your IP blocked or even bring down the site (DoS attack)
    n_threads = 10

    # use a standard queue object (thread-safe) for communication
    queue = Queue.Queue()

    # create our threads
    threads = []
    for i in range(n_threads):
        threads.append(Crawler(i, queue))

    # generate the urls and fill the queue
    url_template = 'http://www.billboard.com/charts/%s/hot-100'
    start_date = datetime.datetime(year=1958, month=8, day=9)
    end_date = datetime.datetime(year=1959, month=9, day=5)
    delta = datetime.timedelta(weeks=1)

    week = 0
    date = start_date + delta*week
    while date <= end_date:
        url = url_template % date.strftime('%Y-%m-%d')
        queue.put(url)
        week += 1
        date = start_date + delta*week

    # start crawling!
    for t in threads:
        t.start()

    # wait until the queue is empty
    while not queue.empty():
        time.sleep(0.01)

    # kill the threads
    for t in threads:
        t.join()

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用Urllib2更有效地刮取？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >