在Python中将图遍历转换为多进程
我最近在研究一个图遍历算法,想在一个简单的网络上运行它。因为当我把它扩展到整个网络时,会需要很多输入输出的操作,所以我想用多进程来处理。简单版本的运行速度挺快的:
already_seen = {}
already_seen_get = already_seen.get
GH_add_node = GH.add_node
GH_add_edge = GH.add_edge
GH_has_node = GH.has_node
GH_has_edge = GH.has_edge
def graph_user(user, depth=0):
logger.debug("Searching for %s", user)
logger.debug("At depth %d", depth)
users_to_read = followers = following = []
if already_seen_get(user):
logging.debug("Already seen %s", user)
return None
result = [x.value for x in list(view[user])]
if result:
result = result[0]
following = result['following']
followers = result['followers']
users_to_read = set().union(following, followers)
if not GH_has_node(user):
logger.debug("Adding %s to graph", user)
GH_add_node(user)
for follower in users_to_read:
if not GH_has_node(follower):
GH_add_node(follower)
logger.debug("Adding %s to graph", follower)
if depth < max_depth:
graph_user(follower, depth + 1)
if GH_has_edge(follower, user):
GH[follower][user]['weight'] += 1
else:
GH_add_edge(user, follower, {'weight': 1})
而且它的速度实际上比我的多进程版本快很多:
to_write = Queue()
to_read = Queue()
to_edge = Queue()
already_seen = Queue()
def fetch_user():
seen = {}
read_get = to_read.get
read_put = to_read.put
write_put = to_write.put
edge_put = to_edge.put
seen_get = seen.get
while True:
try:
logging.debug("Begging for a user")
user = read_get(timeout=1)
if seen_get(user):
continue
logging.debug("Adding %s", user)
seen[user] = True
result = [x.value for x in list(view[user])]
write_put(user, timeout=1)
if result:
result = result.pop()
logging.debug("Got user %s and result %s", user, result)
following = result['following']
followers = result['followers']
users_to_read = list(set().union(following, followers))
[edge_put((user, x, {'weight': 1})) for x in users_to_read]
[read_put(y, timeout=1) for y in users_to_read if not seen_get(y)]
except Empty:
logging.debug("Fetches complete")
return
def write_node():
users = []
users_app = users.append
write_get = to_write.get
while True:
try:
user = write_get(timeout=1)
logging.debug("Writing user %s", user)
users_app(user)
except Empty:
logging.debug("Users complete")
return users
def write_edge():
edges = []
edges_app = edges.append
edge_get = to_edge.get
while True:
try:
edge = edge_get(timeout=1)
logging.debug("Writing edge %s", edge)
edges_app(edge)
except Empty:
logging.debug("Edges Complete")
return edges
if __name__ == '__main__':
pool = Pool(processes=1)
to_read.put(me)
pool.apply_async(fetch_user)
users = pool.apply_async(write_node)
edges = pool.apply_async(write_edge)
GH.add_weighted_edges_from(edges.get())
GH.add_nodes_from(users.get())
pool.close()
pool.join()
我搞不明白为什么单进程的版本会这么快。理论上,多进程版本应该是可以同时进行读写操作的。我怀疑可能是队列上的锁竞争导致了速度变慢,但我没有确凿的证据。随着我增加fetch_user进程的数量,似乎运行得更快了,但我又遇到了数据同步的问题。以下是我想到的一些问题:
- 这真的适合用多进程吗?我最开始是想用它来并行从数据库获取数据。
- 在从同一个队列读写时,怎么避免资源竞争?
- 我是不是漏掉了设计上的一些明显问题?
- 我该如何在多个读取者之间共享一个查找表,这样就不会重复获取同一个用户的数据?
- 当我增加获取进程的数量时,写入者最终会被锁住。看起来写队列没有在写入,但读队列却满了。有没有比使用超时和异常处理更好的方法来处理这种情况?
1 个回答
1
在Python中,队列是同步的。这意味着在任何时候只能有一个线程进行读写操作,这样会导致你的应用程序出现瓶颈。
一个更好的解决方案是根据哈希函数来分配处理任务,并通过简单的取模运算将任务分配给不同的线程。举个例子,如果你有4个线程,你可以设置4个队列:
thread_queues = []
for i in range(4):
thread_queues = Queue()
for user in user_list:
user_hash=hash(user.user_id) #hash in here is just shortcut to some standard hash utility
thread_id = user_hash % 4
thread_queues[thread_id].put(user)
# From here ... your pool of threads access thread_queues but each thread ONLY accesses
# one queue based on a numeric id given to each of them.
大多数哈希函数会均匀地分配你的数据。我通常使用UMAC。不过你也可以试试Python字符串实现中的哈希函数。
另一个改进的方法是避免使用队列,而是使用一个非同步的对象,比如列表。