我在计算相似度。首先,我使用RAKE库从爬网作业中提取关键字。然后我将每个作业的关键字放入单独的数组中,然后将所有这些数组组合成documentArray。在
documentArray = ['Anger command,Assertiveness,Approachability,Adaptability,Authenticity,Aggressiveness,Analytical thinking,Molecular Biology,Molecular Biology,Molecular Biology,molecular biology,molecular biology,Master,English,Molecular Biology,,Islamabad,Islamabad District,Islamabad Capital Territory,Pakistan,,Rawalpindi,Rawalpindi,Punjab,Pakistan'"], ['competitive compensation,assay design,positive attitude,regular basis,motivate others,meetings related,improve state,travel on,phd degree,meeting abstracts,benefits package,daily basis,scientific papers,application notes']
queryStr = 'In Vitro,Biochemistry,PCR,Western Blotting,Neuroscience,Molecular Biology,Cell biology,Immunohistochemistry,Microscopy,Animal Models,Presentations,Immunoprecipitation,Cell biology,Master's Degree,Bachelor's Degree,,,,,'
然后我写了下面的GENSIM代码
class Gensim:
def __init__(self): print("Init") def calculateGensimSimilarity(self, texts, query): dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts] lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2) lda = models.LdaModel(corpus, id2word=dictionary, num_topics=2) index_lsi = similarities.MatrixSimilarity(lsi[corpus]) index_lda = similarities.MatrixSimilarity(lda[corpus]) vec_bow = dictionary.doc2bow(query.lower().split()) vec_lsi = lsi[vec_bow] vec_lda = lda[vec_bow] print("LSI Model") sims_lsi = index_lsi[vec_lsi] print("LDA Model") print(sims_lsi) sims_lda = index_lda[vec_lda] print(sims_lda)
它正在打印LSA分数0和LDA分数90%+匹配。请让我知道我错在哪里,我如何修改,以计算正确的余弦相似度。在
LSA Score[ 0. 0.] LDA Score[ 0.94234258 0.9477495 ]
目前没有回答
相关问题 更多 >
编程相关推荐