优化图挖掘/模式识别方法

2024-04-19 12:02:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在编写一个Python脚本,以提取大型图形数据集中标记节点之间频繁出现的连接(>;每次100k个图形,每个图形有10到20个节点)。有没有更好的办法在相当长的时间内做到这一点

目前,我的解决方案是为每个图创建邻接矩阵,并从中提取连接

''''''''''''''''''
create idconn X graph matrix
mat = np.zeros([graph, node, node]) is the adjacency matrix of the dataset
idconn = 2*node is the maximum number of possible connections between 
nodes (this is mandatory)
sel_conn = 10 for my example
''''''''''''''''''
def arr_bidim(mat):
    arr_bd = np.zeros([idconn, graph])
    for i in range(0, graph):
        for x in range(0, node):
            for j in range(0, node):
                if arr_bd[(j,i)] == 0 and x == 0:
                    arr_bd[(j,i)] = mat[(i,x,j)] 
                if arr_bd[(node+x,i)] == 0 and x == j:
                    if x == 0:
                        arr_bd[(node+x,i)] = 0
                    else:
                        arr_bd[(node+x,i)] = mat[(i,x,j)] 
    return arr_bd

''''''''''''''''''
create the array with the most frequent connections
''''''''''''''''''

def frq(arr_bd):
    arr_f = np.zeros([idconn, 4])
    for x in range(0, idconn): #finds the most frequent connection
        for i in range(0, graph):
            arr_f[(x,1)] += arr_bd[(x,i)]
        arr_f[(x,0)] = x
        if arr_f[(x, 1)] == graph:
            arr_f[(x, 1)] = 0
    arr_f = np.flipud(arr_f[arr_f[:,1].argsort(kind='quicksort')]) 
    arr_f = np.delete(arr_f, slice(sel_conn, idconn), axis = 0)
    return arr_f

''''''''''''''''
"cluster" the co-occurring connections
''''''''''''''''

def find_cluster():
    arr_bd = arr_bidim(mat)
    arr_f = frq(arr_bd)
    temp = np.zeros([idconn])
    for t in range(0, sel_conn):
        i = int(arr_f[(t,0)]) 
        for x in range(0, idconn): 
            temp[x] = 0
            if x != i:
                for y in range(0, graph):
                    if (arr_bd[(i,y)] == 1) & (arr_bd[(x,y)] == 1):
                        temp[x] += 1
                    if (arr_f[(t,3)] < temp[x]):
                        arr_f[(t,3)] = temp[x]
                        arr_f[(t,2)] = x
    return arr_f

 arr_f = find_cluster()

这大约需要1分钟20秒。我想了解是否有可能以某种方式对此进行优化,或者是否有其他算法可以在任何情况下产生类似的结果(即,更大的数据集或检测为“模式”的两个以上连接)


Tags: theinnode图形forifnpzeros