如何计算两个文本文档之间的相似度？

Question

我正在考虑做一个自然语言处理（NLP）项目，任何编程语言都可以（不过我更喜欢用Python）。

我想拿两份文档来看看它们有多相似。

Answer 1

这是个老问题，但我发现用 Spacy 可以很简单地解决。读取文档后，可以使用一个简单的接口 similarity 来找到文档向量之间的余弦相似度。

首先，你需要安装这个软件包并下载模型：

pip install spacy
python -m spacy download en_core_web_sm

然后可以这样使用：

import spacy
nlp = spacy.load('en_core_web_sm')
doc1 = nlp(u'Hello hi there!')
doc2 = nlp(u'Hello hi there!')
doc3 = nlp(u'Hey whatsup?')

print (doc1.similarity(doc2)) # 0.999999954642
print (doc2.similarity(doc3)) # 0.699032527716
print (doc1.similarity(doc3)) # 0.699032527716

Answer 2

和@larsman的内容一样，不过加了一些预处理的步骤。

import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('punkt') # if necessary...


stemmer = nltk.stem.porter.PorterStemmer()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)

def stem_tokens(tokens):
    return [stemmer.stem(item) for item in tokens]

'''remove punctuation, lowercase, stem'''
def normalize(text):
    return stem_tokens(nltk.word_tokenize(text.lower().translate(remove_punctuation_map)))

vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')

def cosine_sim(text1, text2):
    tfidf = vectorizer.fit_transform([text1, text2])
    return ((tfidf * tfidf.T).A)[0,1]


print cosine_sim('a little bird', 'a little bird')
print cosine_sim('a little bird', 'a little bird chirps')
print cosine_sim('a little bird', 'a big dog barks')

Answer 3

通常的做法是把文档转换成TF-IDF向量，然后计算它们之间的余弦相似度。任何一本关于信息检索的教科书都会讲到这个内容。特别是可以看看这本免费的在线书籍《信息检索导论》，链接在这里：信息检索导论。

计算成对相似度

TF-IDF（以及类似的文本转换）可以在Python的Gensim和scikit-learn这两个库中实现。在后者中，计算余弦相似度非常简单：

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

或者，如果文档是普通字符串的话：

>>> corpus = ["I'd like an apple", 
...           "An apple a day keeps the doctor away", 
...           "Never compare an apple to an orange", 
...           "I prefer scikit-learn to Orange", 
...           "The scikit-learn docs are Orange and Blue"]                                                                                                                                                                                                   
>>> vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   
>>> tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       
>>> pairwise_similarity = tfidf * tfidf.T

不过，Gensim可能在这类任务上有更多的选项。

还可以参考这个问题。

[免责声明：我参与了scikit-learn中TF-IDF的实现。]

解读结果

上面提到的pairwise_similarity是一个Scipy的稀疏矩阵，它的形状是方形的，行数和列数都等于文档的数量。

>>> pairwise_similarity                                                                                                                                                                                                                                      
<5x5 sparse matrix of type '<class 'numpy.float64'>'
    with 17 stored elements in Compressed Sparse Row format>

你可以通过.toarray()或.A把稀疏数组转换成NumPy数组：

>>> pairwise_similarity.toarray()                                                                                                                                                                                                                            
array([[1.        , 0.17668795, 0.27056873, 0.        , 0.        ],
       [0.17668795, 1.        , 0.15439436, 0.        , 0.        ],
       [0.27056873, 0.15439436, 1.        , 0.19635649, 0.16815247],
       [0.        , 0.        , 0.19635649, 1.        , 0.54499756],
       [0.        , 0.        , 0.16815247, 0.54499756, 1.        ]])

假设我们想找到与最后一篇文档“scikit-learn文档是橙色和蓝色”最相似的文档。这篇文档在corpus中的索引是4。你可以通过获取那一行的最大值索引来找到最相似的文档，但首先你需要屏蔽掉1，这代表每篇文档与自身的相似度。你可以通过np.fill_diagonal()来实现这一点，而获取最大值索引则可以用np.nanargmax()：

>>> import numpy as np     
                                                                                                                                                                                                                                  
>>> arr = pairwise_similarity.toarray()     
>>> np.fill_diagonal(arr, np.nan)                                                                                                                                                                                                                            
                                                                                                                                                                                                                 
>>> input_doc = "The scikit-learn docs are Orange and Blue"                                                                                                                                                                                                  
>>> input_idx = corpus.index(input_doc)                                                                                                                                                                                                                      
>>> input_idx                                                                                                                                                                                                                                                
4

>>> result_idx = np.nanargmax(arr[input_idx])                                                                                                                                                                                                                
>>> corpus[result_idx]                                                                                                                                                                                                                                       
'I prefer scikit-learn to Orange'

注意：使用稀疏矩阵的目的是为了节省（大量的空间）在处理大规模语料库和词汇时。你也可以选择不转换为NumPy数组，而是直接进行：

>>> n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
>>> pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
>>> pairwise_similarity[input_idx].argmax()                                                                                                                                                                                                                  
3

如何计算两个文本文档之间的相似度？

14 个回答

计算成对相似度

解读结果

撰写回答