Python提供了一种更快的方法,可以将矩阵的每一项与其他每一项进行比较

2024-06-12 13:31:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我知道还有其他一些问题也有类似的问题,但没有一个问题涵盖了我要找的。我正在寻找一种相对快速的方法来计算矩阵中每个项目与其他项目的相似性。我正在测试一种NLP技术来衡量它在衡量文档相似性方面的有效性。你知道吗

所以我有一个这样行的矩阵(我现在把它存储为字典,但我可以把它转换成矩阵): A=[a1,a2,…,am] B=[b1,b2,…,bm] ... N=[n1,n2,…,nm]

我的算法是遍历每个类别。在一个类别中,我遍历该类别中的元素。然后,对于每个元素,我遍历同一类别中的每个元素并找到平均相似度。对于每个元素,我还遍历类别之外的每个元素,并找到平均相似度。我不断地平均每个元素的平均值,这为我提供了一种“元平均值”,即inCategoy文档与outCategory文档之间的相似程度。这是我的密码:

class SimilarityTesting:
def __init__(self,documents,space,documentdictionary, tfidf=None,hes=False,printer=False):
    self.documents = documents
    self.space = space
    self.documentDictionary = documentdictionary
    self.tfidf = tfidf
    self.hes = hes
    self.printer = printer
    self.pairwiseDictionary = {}


    #self.testingExecute()


def testingExecute(self):


    withinCategorySum = 0
    outsideCategorySum = 0
    i=0

    for categoryLabel, categoryDocuments in self.documents.iteritems():
        categoryAverage,setAverage = self.categoryComparison(categoryLabel, categoryDocuments)

        withinCategorySum+=categoryAverage
        outsideCategorySum +=setAverage
        i+=1

    withinAverage = withinCategorySum/i
    outsideAverage = outsideCategorySum/i
    ratio = withinAverage/outsideAverage

    print "The average similarity of documents within their category is %s" % withinAverage
    print "The average similarity of documents not within their category is %s" % outsideAverage
    print "The Ratio of Difference is %s" % ratio
    print
    return withinAverage, outsideAverage, ratio

def categoryComparison(self,categoryLabel,categoryDocuments):
    if self.printer: print "CATEGORY ", categoryLabel
    categorySum = 0
    nonCategorySum = 0
    i =0


    nonCategory = [x for x in self.documents.values() if x != categoryDocuments]
    flatNoncategory = [val for sublist in nonCategory for val in sublist]

    #I believe this is the best place to do parallelization
    for element in categoryDocuments:
        categorySim = self.itemPairCompare(element,categoryDocuments)
        categorySum+=categorySim

        nonCategorySim = self.itemPairCompare(element,flatNoncategory)
        nonCategorySum+=nonCategorySim

        i+=1
    categoryAverage = categorySum/i
    noncategoryAverage = nonCategorySum/i

    if self.printer:
        print
        print "AVERAGE SIMILARITY OF CATEGORY ITEM-IN-CATEGORY SIMILARTY ", categoryAverage
        print "AVERAGE SIMILART OF CATEGORY'S ITEM-NOT-IN-CATEGORY SIMILARITY ", noncategoryAverage
        print
        print

    return categoryAverage,noncategoryAverage


def itemPairCompare(self,item, listDocuments):

    #print "ITEM WITHIN CATEGORY"

    sum = 0
    i = 0

    for value in listDocuments:
        if item != value:

            itemID = self.documentDictionary[str(item)]
            valueID = self.documentDictionary[str(value)]

            pair1ID = itemID + valueID
            pair2ID = valueID + itemID

            if pair1ID in self.pairwiseDictionary:
                sim = self.pairwiseDictionary[pair1ID]
            elif pair2ID in self.pairwiseDictionary:
                sim = self.pairwiseDictionary[pair2ID]

            if self.tfidf:
                vec1_tfidf=self.tfidf[item]
                vec1 = self.space[vec1_tfidf]

                vec2_tfidf = self.tfidf[value]
                vec2 = self.space[vec2_tfidf]

                sim = matutils.cossim(vec1, vec2)

            elif self.hes:
                vec1=self.space[item]
                vec2=self.space[value]
                dense1 = gensim.matutils.sparse2full(vec1, self.space.num_topics)
                dense2 = gensim.matutils.sparse2full(vec2, self.space.num_topics)
                hes = np.sqrt(0.5 * ((np.sqrt(dense1) - np.sqrt(dense2))**2).sum())
                sim = 1-hes
                self.pairwiseDictionary[pair1ID] = sim

            else:
                sim = matutils.cossim(self.space[item],self.space[value])
                self.pairwiseDictionary[pair1ID] = sim
            #print sim

            sum+=sim
            i+=1

    average  = sum/i
    if self.printer: print "ITEM'S AVERAGE SIMILARITY TO ITEMS IN CATEGORY IS ", average
    return average

这段代码是一个更大的管道的下游,这个管道涉及到标记化和将文档转换为一包单词,所以我不能提供一个真正的示例矩阵来运行它。我提供的代码更多的是为了显示我现在所处的位置。基本上,我想知道我是否可以创建一个向量矩阵,然后使用numpy比迭代更快地比较每个向量和其他向量。你知道吗

我通过创建一个已经出现的元素对字典来避免重复计算,从而节省了一些时间。然而,这个东西仍然需要很长时间才能运行(我相信是O(NM^2),我正在尝试找出优化或并行化它的方法。这是努比擅长的吗?此外,我还了解到使用python进行多核处理有点困难。这是多处理所要解决的问题吗?有没有人对如何以更理想的方式做到这一点有什么建议?谢谢


Tags: inself元素forifspacesimprinter