<p>除了spaCy之外,我还建议使用<a href="https://en.wikipedia.org/wiki/Jaccard_index" rel="nofollow noreferrer">Jaccard similarity index</a>,如果您所寻找的只是词汇重叠/相似性</p>
<p>你需要<a href="https://www.nltk.org/install.html" rel="nofollow noreferrer">install NLTK</a></p>
<pre><code>from nltk.util import ngrams
def jaccard_similarity(str1, str2, n):
str1_bigrams = list(ngrams(str1, n))
str2_bigrams = list(ngrams(str2, n))
intersection = len(list(set(str1_bigrams).intersection(set(str2_bigrams))))
union = (len(set(str1_bigrams)) + len(set(str2_bigrams))) - intersection
return float(intersection) / union
</code></pre>
<p>在上面的函数中,您可以选择<code>n</code>(指n-gram中的“n”)作为您想要的内容。我通常使用<code>n=2</code>来使用bigram-Jaccard相似性,但这取决于您</p>
<p>现在将其应用到您的示例中,我将亲自计算每个列表中每对单词的bigram Jaccard相似度,并平均这些值(假设您有上面定义的<code>jaccard_similarity</code>函数):</p>
<pre><code>>>> from itertools import product
>>> book1_topics = ["god", "bible", "book", "holy", "religion", "Christian"]
>>> book2_topics = ["god", "Christ", "idol", "Jesus"]
>>> pairs = list(product(book1_topics, book2_topics))
>>> similarities = [jaccard_similarity(str1, str2, 2) for str1, str2 in pairs]
>>> avg_similarity = sum(similarities) / len(similarities)
</code></pre>