如何加快Python中的网络爬虫速度？

1 投票

2 回答

1719 浏览

提问于 2025-04-18 00:51

我正在使用 urllib.urlopen() 方法和 BeautifulSoup 来进行网页抓取。可是我对浏览速度不太满意，我在想 urllib 在解析的时候是不是除了 HTML 还要加载其他东西。我在文档里找不到它是否默认会读取或检查更大的数据（比如图片、Flash 等）。

所以，如果 urllib 需要加载像图片、Flash、JavaScript 这些东西的话，怎么才能避免对这些数据类型发起 GET 请求呢？

性能优化网页抓取网络爬虫 urllib beautifulsoup 数据请求

2 个回答

使用线程！这非常简单。这里有一个例子。你可以根据自己的需要调整连接的数量。

import threading, Queue
import urllib

urls = [
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',
    'http://www.google.com',
    'http://www.amazon.com',
    'http://www.ebay.com',    
    ]

queue = Queue.Queue()
for x,url in enumerate(urls):
    filename = "datafile%s-%s" % (x,url)
    queue.put((url, filename))


num_connections = 10

class WorkerThread(threading.Thread):
    def __init__(self, queue):
        threading.Thread.__init__(self)
        self.queue = queue

    def run(self):
        while 1:
            try:
                url, filename = self.queue.get_nowait()
            except Queue.Empty:
                raise SystemExit

            urllib.urlretrieve(url,filename.replace('http://',''))

# start threads
threads = []
for dummy in range(num_connections):
    t = WorkerThread(queue)
    t.start()
    threads.append(t)


# Wait for all threads to finish
for thread in threads:
    thread.join()

回答于 2025-04-18 由 Python大师

分享举报

试试requests这个库吧——它可以管理HTTP连接池，这样可以让爬虫的速度更快。

而且，它在处理其他事情，比如 cookies（小数据文件）、身份验证等方面，比urllib做得要好得多，并且和BeautifulSoup配合得也很不错。

回答于 2025-04-18 由 Python大师

分享举报

如何加快Python中的网络爬虫速度？

2 个回答

撰写回答