使用spacy_universal_sentence_encoder进行相似度检测时内存问题

Question

我正在使用 spacy_universal_sentence_encoder （可以在这里找到：https://spacy.io/universe/project/spacy-universal-sentence-encoder）来开发一个查重应用。

经过测试，我发现这个模型在查重方面很实用，而且准确性更高（具体可以参考这里和这里），相比我之前尝试过的其他库。

我用一些“简单”的句子做测试，示例如下：

import spacy_universal_sentence_encoder

# Load one of the models: ['en_use_md', 'en_use_lg', 'xx_use_md', 'xx_use_lg']
nlp = spacy_universal_sentence_encoder.load_model('xx_use_lg')
doc1 = nlp("Toto va à l'école avec son nouveau sac.")
doc2 = nlp("Toto came to school today with a new bag.")

# Similarity of two documents
print("Similarity of two texts : ", doc1, "<->", doc2, doc1.similarity(doc2))

我得到了以下输出结果：

2024-03-28 18:28:21.505655: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-28 18:28:38.693139: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Similarity of two texts :  Toto va à l'école avec son nouveau sac. <-> Toto came to school today with a new bag. 0.8622567857770158

问题是，在我的查重应用中，我需要检查一个提交的内容（可能是整个文档）与一组文档的相似性。因此，这里要检查的内容远不止一两个句子。

def check_similarity(document_to_check_against: str, document_to_check: str) -> float:
    """Check similarity between two given texts using spacy."""
    nlp = spacy_universal_sentence_encoder.load_model('xx_use_lg')
    doc1 = nlp(document_to_check_against)
    doc2 = nlp(document_to_check)
    similarity = doc1.similarity(doc2)
    return similarity

这让我收到一个警告：W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 9059656832 exceeds 10% of free system memory，这导致服务器崩溃。我查了一下，发现这和我可以减少的批处理大小有关。请问我该在哪里减少批处理大小？我想知道这里的批处理大小是不是指正在处理的文档的文本量？我该如何解决这个问题？

内存管理文本处理机器学习服务器崩溃相似度检测批处理大小查重应用句子嵌入

使用spacy_universal_sentence_encoder进行相似度检测时内存问题

0 个回答

撰写回答