使用python中的余弦相似度返回与查询文档最相似的文档

def getvectorKeywordIndex(self, documentList): """ create the keyword associated to the position of the elements within the document vectors """ #Mapped documents into a single word string vocabularyString = " ".join(documentList) vocabularylist= vocabularyString.split(' ') vocabularylist= list(set(vocabularylist)) print 'vocabularylist',vocabularylist vectorIndex={} offset=0 #Associate a position with the keywords which maps to the dimension on the vector used to represent this word for word in vocabularylist: vectorIndex[word]=offset offset+=1 print vectorIndex return vectorIndex,vocabularylist #(keyword:position),vocabularylist

1条回答

网友

1楼 · 发布于 2024-04-19 00:22:17

您还应该将vectorIndex传递给makeVector，并使用它来查找文档和查询中的术语索引。忽略没有出现在vectorIndex中的术语。在

请注意，在处理文档时，您应该真正使用^{}矩阵而不是Numpy数组，否则您将很快耗尽内存。在

（或者，考虑使用scikit learn中的^{}，它为您处理所有这些，使用scipy.sparse矩阵并计算tf idf值。免责声明：我写了部分课程。）

相关问题更多 >

编程相关推荐

热门问题

热门文章