架构python问题

main: parse initial url call function level1 (data1) function level1 (data) parse the url, for data1 use the required xpath to get the dom elements call the next function call level2 (data) function level2 (data2) parse the url, for data2 use the required xpath to get the dom elements call the next function call level3 function level3 (dat3) parse the url, for data3 use the required xpath to get the dom elements call the next function call level4 function level4 (data) parse the url, for data4 use the required xpath to get the dom elements at the final function.. --all the data output, and eventually returned to the server --at this point the data has elements from each function...

3条回答

网友

1楼 · 编辑于 2024-05-13 14:31:30

看看multiprocessing类。它允许您设置一个工作队列和一个工作人员池，当您解析页面时，您可以派生出由单独进程完成的任务。在

网友

2楼 · 编辑于 2024-05-13 14:31:30

这听起来像是Hadoop上MapReduce的一个用例。在

Hadoop Map/Reduce是一个软件框架，它可以方便地编写应用程序，以可靠、容错的方式在大型集群（数千个节点）上并行处理大量数据（数TB的数据集）。在您的情况下，这将是一个较小的集群。

Map/Reduce作业通常将输入数据集拆分为独立的块，这些块由Map任务以完全并行的方式处理。在

你提到过

i've thought of breaking the app up in a manner that would allow the master to essentially pass packets to the client boxes, in a way to allow each client/function to be run directly from the master.

据我所知，你希望一台主机（box）充当主机，并拥有运行其他功能的客户机。例如，可以运行main（）函数并解析其上的初始URL。好的一点是，您可以在不同的机器上并行处理这些url的任务，因为它们看起来彼此独立。在

因为level4依赖于level3，level3依赖于level2。。依此类推，您可以通过管道将每个输出传递到下一个，而不是从每个输出中调用一个。在

在下面的教程中，我将推荐如何做这件事

The Hadoop tutorial是对什么是map reduce及其工作原理的简单介绍和概述。
Michael Noll's tutorial介绍如何在Python之上使用Hadoop（Mapper和Reducer的基本概念）
最后，a tutorial for a framework called Dumbo，由姓氏.fm，它自动化并基于Michael Noll的基本示例构建，以便在生产系统中使用。

希望这有帮助。在

网友

3楼 · 编辑于 2024-05-13 14:31:30

查看scrapy包。它将允许您轻松创建“客户端应用程序”（又称爬虫、蜘蛛或爬虫）深入网站。在

brool和{a3}都对项目的分布式部分有很好的建议。在

相关问题更多 >

编程相关推荐

热门问题

热门文章