文档相似度：高效比较两个文档

1 投票

3 回答

1108 浏览

提问于 2025-04-15 20:23

我有一个循环，用来计算两个文档之间的相似度。它会收集一个文档中的所有词汇和它们的分数，然后把这些信息放进一个字典里。接着，它会比较这两个字典。

这是我目前的代码，它能工作，但速度非常慢：

# Doc A
cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
doca = cursor1.fetchall()
#convert tuple to a dictionary
doca_dic = dict((row[0], row[1]) for row in doca)

#Doc B
cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
docb = cursor2.fetchall()
#convert tuple to a dictionary
docb_dic = dict((row[0], row[1]) for row in docb)

# loop through each token in doca and see if one matches in docb
for x in doca_dic:
    if docb_dic.has_key(x):
        #calculate the similarity by summing the products of the tf-idf_norm 
        similarity += doca_dic[x] * docb_dic[x]
print "similarity"
print similarity

我对Python还很陌生，所以代码写得有点乱。我需要加快速度，任何帮助都非常感谢。

数据结构算法优化文本比较词汇分析文档相似度

3 个回答

只需要一条SQL查询就能完成这个任务：

SELECT sum(index1.tfidf_norm*index2.tfidf_norm) FROM index index1, index index2 WHERE index1.token=index2.token AND index1.doc_id=? AND index2.doc_id=?

只需把'?'替换成两个文档的ID就可以了。

回答于 2025-04-15 由 Python大师

分享举报

那我们可以把一些工作放到数据库上处理吗？

通过使用连接（join），你可以得到一个基本的结果，像这样：

    Token    A.tfidf_norm B.tfidf_norm
-----------------------------------------
    Apple      12.2          11.00
       ...
    Word       29.87         33.21
    Zealot      0.00         11.56
    Zulu       78.56          0.00

然后你只需要扫描这个结果，进行你的操作。

如果你不需要知道某个词在一个文档中存在而在另一个文档中缺失，那么你就不需要使用外连接（outer join），这样得到的列表就是两个集合的交集。我上面提到的例子会自动给缺失的词赋值为“0”。看看你的“匹配”函数需要什么吧。

回答于 2025-04-15 由 Python大师

分享举报

一个关于Python的小知识：在Python 2.X中，adict.has_key(k)这个方法已经过时了，而在Python 3.X中则完全不再使用。你可以用k in adict来代替，这个方法从Python 2.2就已经可以用了，使用它会更快，因为不需要调用方法。

一个通用的实用建议：遍历较短的字典。

结合的结果：

if len(doca_dic) < len(docb_dict):
    short_dict, long_dict = doca_dic, docb_dic
else:
    short_dict, long_dict = docb_dic, doca_dic
similarity = 0
for x in short_dict:
    if x in long_dict:
        #calculate the similarity by summing the products of the tf-idf_norm 
        similarity += short_dict[x] * long_dict[x]

如果你不需要这两个字典做其他事情，你可以只创建A字典，然后在查询B字典时直接遍历B字典中的（键，值）元组。在执行完docb = cursor2.fetchall()之后，可以用下面的代码替代后面的所有代码：

similarity = 0
for b_token, b_value in docb:
    if b_token in doca_dic:
        similarity += doca_dic[b_token] * b_value

上面代码的替代方案：这个方法工作量更大，但更多的遍历是在C语言中完成，而不是Python中，可能会更快。

similarity = sum(
    doca_dic[k] * docb_dic[k]
    for k in set(doca_dic) & set(docb_dic)
    )

最终版本的Python代码

# Doc A
cursor1.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[i][0]))
doca = cursor1.fetchall()
# Doc B
cursor2.execute("SELECT token, tfidf_norm FROM index WHERE doc_id = %s", (docid[j][0]))
docb = cursor2.fetchall()
if len(doca) < len(docb):
    short_doc, long_doc = doca, docb
else:
    short_doc, long_doc = docb, doca
long_dict = dict(long_doc) # yes, it should be that simple
similarity = 0
for key, value in short_doc:
    if key in long_dict:
        similarity += long_dict[key] * value

另一个实用建议：你没有说明哪一部分运行得慢……是处理字典慢，还是执行选择查询慢？可以在你的脚本中加入一些time.time()的调用来测量时间。

考虑把所有的工作都交给数据库来处理。下面的例子使用了一个固定的SQLite查询，但原理是一样的。

C:\junk\so>sqlite3
SQLite version 3.6.14
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table atable(docid text, token text, score float,
    primary key (docid, token));
sqlite> insert into atable values('a', 'apple', 12.2);
sqlite> insert into atable values('a', 'word', 29.67);
sqlite> insert into atable values('a', 'zulu', 78.56);
sqlite> insert into atable values('b', 'apple', 11.0);
sqlite> insert into atable values('b', 'word', 33.21);
sqlite> insert into atable values('b', 'zealot', 11.56);
sqlite> select sum(A.score * B.score) from atable A, atable B
    where A.token = B.token and A.docid = 'a' and B.docid = 'b';
1119.5407
sqlite>

还值得检查一下数据库表是否有合适的索引（例如，单独在token上建立索引）……没有可用的索引会导致SQL查询运行得非常慢。

解释一下：在token上有索引可能会让你现有的查询，或者“在数据库中完成所有工作”的查询，或者两者都运行得更快，这取决于你数据库软件中的查询优化器的表现。如果没有可用的索引，数据库会读取你表中的所有行——这可不好。

创建索引的命令是：create index atable_token_idx on atable(token);

删除索引的命令是：drop index atable_token_idx;

（但请查阅你自己数据库的文档）

回答于 2025-04-15 由 Python大师

分享举报

文档相似度：高效比较两个文档

3 个回答

撰写回答