我知道还有其他一些问题也有类似的问题,但没有一个问题涵盖了我要找的。我正在寻找一种相对快速的方法来计算矩阵中每个项目与其他项目的相似性。我正在测试一种NLP技术来衡量它在衡量文档相似性方面的有效性。你知道吗
所以我有一个这样行的矩阵(我现在把它存储为字典,但我可以把它转换成矩阵): A=[a1,a2,…,am] B=[b1,b2,…,bm] ... N=[n1,n2,…,nm]
我的算法是遍历每个类别。在一个类别中,我遍历该类别中的元素。然后,对于每个元素,我遍历同一类别中的每个元素并找到平均相似度。对于每个元素,我还遍历类别之外的每个元素,并找到平均相似度。我不断地平均每个元素的平均值,这为我提供了一种“元平均值”,即inCategoy文档与outCategory文档之间的相似程度。这是我的密码:
class SimilarityTesting:
def __init__(self,documents,space,documentdictionary, tfidf=None,hes=False,printer=False):
self.documents = documents
self.space = space
self.documentDictionary = documentdictionary
self.tfidf = tfidf
self.hes = hes
self.printer = printer
self.pairwiseDictionary = {}
#self.testingExecute()
def testingExecute(self):
withinCategorySum = 0
outsideCategorySum = 0
i=0
for categoryLabel, categoryDocuments in self.documents.iteritems():
categoryAverage,setAverage = self.categoryComparison(categoryLabel, categoryDocuments)
withinCategorySum+=categoryAverage
outsideCategorySum +=setAverage
i+=1
withinAverage = withinCategorySum/i
outsideAverage = outsideCategorySum/i
ratio = withinAverage/outsideAverage
print "The average similarity of documents within their category is %s" % withinAverage
print "The average similarity of documents not within their category is %s" % outsideAverage
print "The Ratio of Difference is %s" % ratio
print
return withinAverage, outsideAverage, ratio
def categoryComparison(self,categoryLabel,categoryDocuments):
if self.printer: print "CATEGORY ", categoryLabel
categorySum = 0
nonCategorySum = 0
i =0
nonCategory = [x for x in self.documents.values() if x != categoryDocuments]
flatNoncategory = [val for sublist in nonCategory for val in sublist]
#I believe this is the best place to do parallelization
for element in categoryDocuments:
categorySim = self.itemPairCompare(element,categoryDocuments)
categorySum+=categorySim
nonCategorySim = self.itemPairCompare(element,flatNoncategory)
nonCategorySum+=nonCategorySim
i+=1
categoryAverage = categorySum/i
noncategoryAverage = nonCategorySum/i
if self.printer:
print
print "AVERAGE SIMILARITY OF CATEGORY ITEM-IN-CATEGORY SIMILARTY ", categoryAverage
print "AVERAGE SIMILART OF CATEGORY'S ITEM-NOT-IN-CATEGORY SIMILARITY ", noncategoryAverage
print
print
return categoryAverage,noncategoryAverage
def itemPairCompare(self,item, listDocuments):
#print "ITEM WITHIN CATEGORY"
sum = 0
i = 0
for value in listDocuments:
if item != value:
itemID = self.documentDictionary[str(item)]
valueID = self.documentDictionary[str(value)]
pair1ID = itemID + valueID
pair2ID = valueID + itemID
if pair1ID in self.pairwiseDictionary:
sim = self.pairwiseDictionary[pair1ID]
elif pair2ID in self.pairwiseDictionary:
sim = self.pairwiseDictionary[pair2ID]
if self.tfidf:
vec1_tfidf=self.tfidf[item]
vec1 = self.space[vec1_tfidf]
vec2_tfidf = self.tfidf[value]
vec2 = self.space[vec2_tfidf]
sim = matutils.cossim(vec1, vec2)
elif self.hes:
vec1=self.space[item]
vec2=self.space[value]
dense1 = gensim.matutils.sparse2full(vec1, self.space.num_topics)
dense2 = gensim.matutils.sparse2full(vec2, self.space.num_topics)
hes = np.sqrt(0.5 * ((np.sqrt(dense1) - np.sqrt(dense2))**2).sum())
sim = 1-hes
self.pairwiseDictionary[pair1ID] = sim
else:
sim = matutils.cossim(self.space[item],self.space[value])
self.pairwiseDictionary[pair1ID] = sim
#print sim
sum+=sim
i+=1
average = sum/i
if self.printer: print "ITEM'S AVERAGE SIMILARITY TO ITEMS IN CATEGORY IS ", average
return average
这段代码是一个更大的管道的下游,这个管道涉及到标记化和将文档转换为一包单词,所以我不能提供一个真正的示例矩阵来运行它。我提供的代码更多的是为了显示我现在所处的位置。基本上,我想知道我是否可以创建一个向量矩阵,然后使用numpy比迭代更快地比较每个向量和其他向量。你知道吗
我通过创建一个已经出现的元素对字典来避免重复计算,从而节省了一些时间。然而,这个东西仍然需要很长时间才能运行(我相信是O(NM^2),我正在尝试找出优化或并行化它的方法。这是努比擅长的吗?此外,我还了解到使用python进行多核处理有点困难。这是多处理所要解决的问题吗?有没有人对如何以更理想的方式做到这一点有什么建议?谢谢
目前没有回答
相关问题 更多 >
编程相关推荐