在OSX中编写了多任务处理的脚本，现在在Windows上无法运行

1 投票

2 回答

516 浏览

提问于 2025-04-18 12:08

我写的这个程序在OSX和Linux上都能运行。它使用selenium从一些网页上抓取数据，处理这些数据并保存。为了提高效率，我加入了多进程的池和管理器。我创建了一个池，对于列表中的每个项目，它调用抓取类，启动一个phantomjs实例并进行抓取。因为我使用了多进程池，并且想在不同的线程之间传递数据，所以我了解到使用multiprocessing.manager是个好办法。如果我写了 manager = Manager() info = manager.dict([]) 这会创建一个所有线程都能访问的字典，一切都运行得很顺利。

我的问题是，客户想在Windows机器上运行这个程序（我是在OSX上写的）。我以为安装python、selenium然后直接启动就可以了。但我遇到了一些错误，后来我在main.py文件的顶部写了if __name__ == '__main__:，并把所有代码缩进到这个语句里面。问题是，当我把class scrape():放在if语句外面时，它无法访问全局的info，因为它是在作用域外声明的。如果我把class scrape():放到if __name__ == '__main__':里面，就会出现一个属性错误，提示

AttributeError: 'module' object has no attribute 'scrape'

如果我再把manager = Manager()和info = manager.dict([])放回到if __name__ == '__main__'外面，就会在Windows上出现错误，提示我确保使用if __name__ == '__main__'。现在看来，我在这个项目上似乎无论怎么做都不行。

代码布局...

Imports...
from multiprocessing import Pool
from multiprocessing import Manager

manager = Manager()
info = manager.dict([])
date = str(datetime.date.today())

class do_scrape():
    def __init__():
    def...

def scrape_items():#This contains code which creates a pool and then pool.map(do_scrape, s) s = a list of items

def save_scrape():

def update_price():

def main():

main()

基本上，scrape_items是由main调用的，然后scrape_items使用pool.map(do_scrape, s)，这样就会调用do_scrape类，并一个一个地把项目列表传递给它。do_scrape根据"s"中的项目网址抓取网页，然后把信息保存在全局的info中，这个info是multiprocessing.manager的字典。上面的代码没有显示任何if __name__ == '__main__':语句，它只是我在OSX设置下的工作原理的一个大致轮廓。它可以正常运行并完成任务。如果有人能给我一些建议，我会非常感激。谢谢

跨平台进程管理代码调试多任务处理数据抓取 selenium 属性错误线程间通信

2 个回答

找到你程序的起始点，并确保只把这个部分放在你的 if 语句里。例如：

Imports...
from multiprocessing import Pool
from multiprocessing import Manager

manager = Manager()
info = manager.dict([])
date = str(datetime.date.today())

class do_scrape():
    def __init__():
    def...

def scrape_items():#This contains code which creates a pool and then pool.map(do_scrape, s) s = a list of items

def save_scrape():

def update_price():

def main():

if __name__ == "__main__":
    main()

简单来说，if 里面的内容只有在你直接运行这个文件的时候才会执行。如果这个文件是从其他文件中导入的，那么所有的属性都会被定义，这样你就可以在不实际执行这个模块的情况下访问各种属性。

想了解更多，可以看看这里： if __name__ == "__main__": 是干什么的？

回答于 2025-04-18 由 Python大师

分享举报

看到你的代码会更有帮助，不过听起来你只需要把共享的字典明确地传递给 scrape 函数，像这样：

import multiprocessing
from functools import partial

def scrape(info, item):
   # Use info in here

if __name__ == "__main__":
   manager = multiprocessing.Manager()
   info = manager.dict()
   pool = multiprocessing.Pool()
   func = partial(scrape, info) # use a partial to make it easy to pass the dict to pool.map
   items = [1,2,3,4,5] # This would be your actual data
   results = pool.map(func, items)
   #pool.apply_async(scrape, [shared_dict, "abc"]) # In case you're not using map...

注意，你不应该把所有的代码都放在 if __name__ == "__main__": 这个保护块里，只需要放入那些实际创建进程的代码，比如创建 Manager 和 Pool 的部分。

任何你想在子进程中运行的方法都必须在模块的顶层声明，因为它需要能从子进程中的 __main__ 导入。当你把 scrape 放在 if __name__ ... 保护块里时，它就无法从 __main__ 模块中导入了，所以你看到了 AttributeError: 'module' object has no attribute 'scrape' 这个错误。

编辑：

以你的例子为例：

import multiprocessing
from functools import partial

date = str(datetime.date.today())

#class do_scrape():
#    def __init__():
#    def...
def do_scrape(info, s):
    # do stuff
    # Also note that do_scrape should probably be a function, not a class

def scrape_items():
    # scrape_items is called by main(), which is protected by a`if __name__ ...` guard 
    # so this is ok.
    manager = multiprocessing.Manager()
    info = manager.dict([])
    pool = multiprocessing.Pool()
    func = partial(do_scrape, info) 
    s = [1,2,3,4,5] # Substitute with the real s
    results = pool.map(func, s)     

def save_scrape():

def update_price():

def main():
    scrape_items()

if __name__ == "__main__": 
    # Note that you can declare manager and info here, instead of in scrape_items, if you wanted
    #manager = multiprocessing.Manager()
    #info = manager.dict([])
    main()

还有一个重要的注意事项是，传给 map 的第一个参数应该是一个函数，而不是一个类。这在文档中有说明（multiprocessing.map 是等同于内置的 map 函数）。

回答于 2025-04-18 由 Python大师

分享举报

在OSX中编写了多任务处理的脚本，现在在Windows上无法运行

2 个回答

撰写回答