支持代理的多线程爬虫Python库？

1 投票

2 回答

9321 浏览

提问于 2025-04-15 15:24

有没有人知道除了urllib之外，还有什么更高效的工具可以快速、多线程地下载网址，并且能通过http代理工作？我知道有一些，比如Twisted、Scrapy、libcurl等等，但我对它们了解不够，无法做出选择，也不确定它们是否能使用代理。有没有人能推荐一个最适合我需求的工具？谢谢！

http代理网络爬虫爬虫框架多线程爬虫数据下载工具

2 个回答

通常，代理服务器会根据网站的类型来过滤网站。这意味着它们会根据网站是怎么制作的来决定是否允许访问。比如，YouTube 被归类为音频/视频流，所以在一些地方，特别是学校，YouTube 是被屏蔽的。
如果你想绕过这些代理服务器，从一个网站获取数据，并把这些数据放到你自己真正的网站上，比如一个可以注册的.com网站，你可以这样做。
在你创建和注册这个网站的时候，可以把你的网站分类为任何你想要的类型。

回答于 2025-04-15 由 Python大师

分享举报

在Python中实现这个功能很简单。

urlopen()这个函数可以直接和不需要认证的代理服务器一起使用。在Unix或Windows环境下，你需要在启动Python解释器之前，设置http_proxy、ftp_proxy或gopher_proxy这些环境变量，值为你代理服务器的URL。

# -*- coding: utf-8 -*-

import sys
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
from Queue import Queue, Empty
from threading import Thread

visited = set()
queue = Queue()

def get_parser(host, root, charset):

    def parse():
        try:
            while True:
                url = queue.get_nowait()
                try:
                    content = urlopen(url).read().decode(charset)
                except UnicodeDecodeError:
                    continue
                for link in BeautifulSoup(content).findAll('a'):
                    try:
                        href = link['href']
                    except KeyError:
                        continue
                    if not href.startswith('http://'):
                        href = 'http://%s%s' % (host, href)
                    if not href.startswith('http://%s%s' % (host, root)):
                        continue
                    if href not in visited:
                        visited.add(href)
                        queue.put(href)
                        print href
        except Empty:
            pass

    return parse

if __name__ == '__main__':
    host, root, charset = sys.argv[1:]
    parser = get_parser(host, root, charset)
    queue.put('http://%s%s' % (host, root))
    workers = []
    for i in range(5):
        worker = Thread(target=parser)
        worker.start()
        workers.append(worker)
    for worker in workers:
        worker.join()

回答于 2025-04-15 由 Python大师

分享举报

支持代理的多线程爬虫Python库？

2 个回答

撰写回答