有没有一种方法可以使用预训练的doc2vec模型来评估一些文档数据

2024-05-16 17:49:03 发布

您现在位置:Python中文网/ 问答频道 /正文

最近我在做一项研究,目的是对一个庞大的文本数据库进行无监督聚类。首先我尝试了一袋词,然后几个聚类算法给了我一个很好的结果,但现在我正在尝试进入doc2vec表示,它似乎不适合我,我不能加载准备好的模型和它一起工作,而是训练我自己没有任何结果。在

我试着训练我的模特学习10公里的课文

model = gensim.models.doc2vec.Doc2Vec(vector_size=500, min_count=2, epochs=100,workers=8)

(每个词大约20-50个词)但是相似性评分是由gensim提出的

^{pr2}$

对我的模特来说比这更糟糕。 更糟糕的是,我的意思是完全相同或几乎完全相同的文本,其相似度得分与我所能想到的没有任何关联的文本兼容。所以我决定使用来自Is there pre-trained doc2vec model?的模型来使用一些经过预训练的模型,这些模型可能在单词之间有更多的连接。抱歉,序言有点长,但问题是我怎么插进去?有人能提供一些想法吗,我如何使用从https://github.com/jhlau/doc2vec加载的gensim模型将我自己的文本数据集转换成相同长度的向量?我的数据是经过预处理的(词干,没有标点,小写,没有nlst.语料库如果需要的话,我可以从列表、数据帧或文件中传递它,代码问题是如何将我自己的数据传递到预训练的模型?任何帮助都将不胜感激。在

UPD:让我感觉不好的输出

Train Document (6134): «use medium paper examination medium habit one week must chart daily use medium radio television newspaper magazine film video etc wake radio alarm listen traffic report commuting get news watch sport soap opera watch tv use internet work home read book see movie use data collect journal basis analysis examining information using us gratification model discussed textbook us gratification article provided perhaps carrying small notebook day inputting material evening help stay organized smartphone use note app track medium need turn diary trust tell tell immediately paper whether actually kept one begin medium diary soon possible order give ample time complete journal write paper completed diary need write page paper use medium functional analysis theory say something best understood understanding used us gratification model provides framework individual use medium basis analysis especially category discussed posted dominick article apply concept medium usage expected le medium use cognitive social utility affiliation withdrawal must draw conclusion use analyzing habit within framework idea discussed text article concept must clearly included articulated paper common mistake student make assignment tell medium habit fail analyze habit within context us gratification model must include idea paper»

Similar Document (6130, 0.6926988363265991): «use medium paper examination medium habit one week must chart daily use medium radio television newspaper magazine film video etc wake radio alarm listen traffic report commuting get news watch sport soap opera watch tv use internet work home read book see movie use data collect journal basis analysis examining information using us gratification model discussed textbook us gratification article provided perhaps carrying small notebook day inputting material evening help stay organized smartphone use note app track medium need turn diary trust tell tell immediately paper whether actually kept one begin medium diary soon possible order give ample time complete journal write paper completed diary need write page paper use medium functional analysis theory say something best understood understanding used us gratification model provides framework individual use medium basis analysis especially category discussed posted dominick article apply concept medium usage expected le medium use cognitive social utility affiliation withdrawal must draw conclusion use analyzing habit within framework idea discussed text article concept must clearly included articulated paper common mistake student make assignment tell medium habit fail analyze habit within context us gratification model must include idea paper»

这看起来很好,但看看其他输出

Train Document (1185): «photography garry winogrand would like paper life work garry winogrand famous street photographer also influenced street photography aim towards thoughtful imaginative treatment detail referencescite research material academic essay university level»

Similar Document (3449, 0.6901006698608398): «tang dynasty write page essay tang dynasty essay discus buddhism tang dynasty name artifact tang dynasty discus them history put heading paragraph information tang dynasty discussed essay»

结果表明,系统中最相似的两个完全相同的文本和两个相似的超级不同文本之间的相似度得分几乎相同,这使得对数据进行任何处理都有问题。 得到我使用的最相似的文件

^{3}$

Tags: 模型文本modelusearticleanalysispaperdiary
1条回答
网友
1楼 · 发布于 2024-05-16 17:49:03

来自https://github.com/jhlau/doc2vec的模型是基于gensim旧版本的一个定制的fork,所以您必须找到/使用它才能使它们可用。在

来自通用数据集(如Wikipedia)的模型可能无法理解您需要的特定于领域的单词,即使在共享单词的地方,这些单词的有效含义也可能有所不同。另外,要使用另一个模型来推断数据上的向量,您应该确保以与处理训练数据相同的方式对文本进行预处理/标记化。在

因此,最好使用一个你自己训练过的模型——这样你就完全理解了——关于领域相关的数据。在

与已发表的Doc2Vec工作相比,每个20-50个单词的10k文档有点小,但可能有用。试图从较小的数据集中获取500维向量可能是个问题。(在数据较少的情况下,可能需要更少的向量维数和更多的训练迭代。)

如果你的自训练模型的结果不令人满意,那么在你的训练和推理代码中可能存在其他问题(这还没有在你的问题中显示)。这也有助于了解与基线(如您提到的单词袋表示法)相比,您的结果如何不令人满意的更具体的例子/细节。如果你在你的问题中添加这些细节,也许可以提供其他建议。在

相关问题 更多 >