使用urlretrieve在支持cookie的网站上进行多线程网页抓取

2 投票

3 回答

3930 浏览

提问于 2025-04-16 18:15

我正在尝试写我的第一个Python脚本，通过很多搜索，我觉得快完成了。不过，我需要一些帮助来完成最后的部分。

我需要写一个脚本，登录一个支持cookie的网站，抓取一些链接，然后启动几个进程来下载文件。我现在的程序是单线程运行的，所以我知道代码是可以工作的。但是，当我尝试创建一个下载工作池时，我遇到了困难。

#manager.py
import Fetch # the module name where worker lives
from multiprocessing import pool

def FetchReports(links,Username,Password,VendorID):
    pool = multiprocessing.Pool(processes=4, initializer=Fetch._ProcessStart, initargs=(SiteBase,DataPath,Username,Password,VendorID,))
    pool.map(Fetch.DownloadJob,links)
    pool.close()
    pool.join()


#worker.py
import mechanize
import atexit

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    Login(User,Password)

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    atexit.register(Logout)

def DownloadJob(link):
    mechanize.urlretrieve(mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)
    return True

在这个版本中，代码失败是因为cookies没有被传递给下载的工作进程，导致urlretrieve无法使用。没关系，我使用了mechanize的.cookiejar类来保存cookies，并把它们传递给工作进程。

#worker.py
import mechanize
import atexit

from multiprocessing import current_process

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    global cookies
    cookies = mechanize.LWPCookieJar()

    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

    Login(User,Password,opener)  # note I pass the opener to Login so it can catch the cookies.

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    cookies.save(DataPath+'\\'+current_process().name+'cookies.txt',True,True)

    atexit.register(Logout)

def DownloadJob(link):
    cj = mechanize.LWPCookieJar()
    cj.revert(filename=DataPath+'\\'+current_process().name+'cookies.txt', ignore_discard=True, ignore_expires=True)
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))

    file = open(DataPath+'\\'+filename, "wb")
    file.write(opener.open(mechanize.urljoin(SiteBase, link)).read())
    file.close

但是，这样做又失败了，因为opener（我想是这个）想把二进制文件传回管理器进行处理，结果我收到了一个“无法序列化对象”的错误信息，指的是它试图读取的网页文件。

显而易见的解决办法是从cookie jar中读取cookies，并在进行urlretrieve请求时手动将它们添加到请求头中，但我想避免这样做，这就是我在寻求建议的原因。

多线程进程管理网页抓取 mechanize 请求头 cookies urlretrieve 下载管理

3 个回答

为了在第一个代码示例中启用 cookie 会话，你需要在 DownloadJob 函数里添加以下代码：

cj = mechanize.LWPCookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))
mechanize.install_opener(opener)

然后你可以像这样获取网址：

mechanize.urlretrieve(mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)

回答于 2025-04-16 由 Python大师

分享举报

创建一个多线程的网页抓取工具其实挺难的。我相信你能搞定，但为什么不直接用已经做好了的工具呢？

我非常推荐你看看 Scrapy http://scrapy.org/。

Scrapy 是一个非常灵活的开源网页抓取框架，它能处理你需要的大部分事情。使用 Scrapy，运行多个抓取程序只需要配置，而不是编程上的麻烦事（http://doc.scrapy.org/topics/settings.html#concurrent-requests-per-spider）。它还支持 cookies、代理、HTTP 认证等等。

对我来说，把我的抓取工具重写成 Scrapy 大约花了 4 个小时。所以请问问自己：你真的想自己解决线程的问题，还是想借用别人的成果，专注于网页抓取而不是线程问题呢？

顺便说一下，你现在在用 mechanize 吗？请注意 mechanize 的常见问题解答中的这一点 http://wwwsearch.sourceforge.net/mechanize/faq.html：

“它是线程安全的吗？”

不。根据我所知，你可以在多线程代码中使用 mechanize，但它不提供任何同步功能：你得自己处理这个。

如果你真的想继续使用 mechanize，开始阅读关于如何提供同步的文档吧。（例如 http://effbot.org/zone/thread-synchronization.htm， http://effbot.org/pyfaq/what-kinds-of-global-value-mutation-are-thread-safe.htm）

回答于 2025-04-16 由 Python大师

分享举报

经过大半天的努力，我发现问题并不在于Mechanize，更多的是代码出错了。经过反复调整和一些抱怨，我终于让代码正常工作了。

为了帮助像我一样将来在网上搜索的人，我把更新后的代码放在下面：

#manager.py [unchanged from original]
def FetchReports(links,Username,Password,VendorID):
    import Fetch
    import multiprocessing

    pool = multiprocessing.Pool(processes=4, initializer=Fetch._ProcessStart, initargs=(SiteBase,DataPath,Username,Password,VendorID,))
    pool.map(Fetch.DownloadJob,_SplitLinksArray(links))
    pool.close()
    pool.join()


#worker.py
import mechanize
from multiprocessing import current_process

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    global cookies
    cookies = mechanize.LWPCookieJar()
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

    Login(User,Password,opener)

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    cookies.save(DataPath+'\\'+current_process().name+'cookies.txt',True,True)

def DownloadJob(link):
    cj = mechanize.LWPCookieJar()
    cj.revert(filename=DataPath+'\\'+current_process().name+'cookies.txt',True,True)
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))

    mechanize.urlretrieve(url=mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)

因为我只是从一个列表中下载链接，所以Mechanize不支持多线程的特性似乎没什么问题【说明一下：我这个过程只运行了三次，所以在更多测试中可能会出现问题】。多进程模块和它的工作池负责处理所有的重活。对我来说，把cookies保存在文件里很重要，因为我下载的网页服务器需要给每个线程分配自己的会话ID，但其他使用这段代码的人可能不需要这样做。我注意到在初始化调用和运行调用之间，它似乎会“忘记”一些变量，所以cookiejar可能无法正常使用。

回答于 2025-04-16 由 Python大师

分享举报

使用urlretrieve在支持cookie的网站上进行多线程网页抓取

3 个回答

撰写回答