我正在编写一个Python脚本,以提取大型图形数据集中标记节点之间频繁出现的连接(>;每次100k个图形,每个图形有10到20个节点)。有没有更好的办法在相当长的时间内做到这一点
目前,我的解决方案是为每个图创建邻接矩阵,并从中提取连接
''''''''''''''''''
create idconn X graph matrix
mat = np.zeros([graph, node, node]) is the adjacency matrix of the dataset
idconn = 2*node is the maximum number of possible connections between
nodes (this is mandatory)
sel_conn = 10 for my example
''''''''''''''''''
def arr_bidim(mat):
arr_bd = np.zeros([idconn, graph])
for i in range(0, graph):
for x in range(0, node):
for j in range(0, node):
if arr_bd[(j,i)] == 0 and x == 0:
arr_bd[(j,i)] = mat[(i,x,j)]
if arr_bd[(node+x,i)] == 0 and x == j:
if x == 0:
arr_bd[(node+x,i)] = 0
else:
arr_bd[(node+x,i)] = mat[(i,x,j)]
return arr_bd
''''''''''''''''''
create the array with the most frequent connections
''''''''''''''''''
def frq(arr_bd):
arr_f = np.zeros([idconn, 4])
for x in range(0, idconn): #finds the most frequent connection
for i in range(0, graph):
arr_f[(x,1)] += arr_bd[(x,i)]
arr_f[(x,0)] = x
if arr_f[(x, 1)] == graph:
arr_f[(x, 1)] = 0
arr_f = np.flipud(arr_f[arr_f[:,1].argsort(kind='quicksort')])
arr_f = np.delete(arr_f, slice(sel_conn, idconn), axis = 0)
return arr_f
''''''''''''''''
"cluster" the co-occurring connections
''''''''''''''''
def find_cluster():
arr_bd = arr_bidim(mat)
arr_f = frq(arr_bd)
temp = np.zeros([idconn])
for t in range(0, sel_conn):
i = int(arr_f[(t,0)])
for x in range(0, idconn):
temp[x] = 0
if x != i:
for y in range(0, graph):
if (arr_bd[(i,y)] == 1) & (arr_bd[(x,y)] == 1):
temp[x] += 1
if (arr_f[(t,3)] < temp[x]):
arr_f[(t,3)] = temp[x]
arr_f[(t,2)] = x
return arr_f
arr_f = find_cluster()
这大约需要1分钟20秒。我想了解是否有可能以某种方式对此进行优化,或者是否有其他算法可以在任何情况下产生类似的结果(即,更大的数据集或检测为“模式”的两个以上连接)
目前没有回答
相关问题 更多 >
编程相关推荐