多语言Bert语句向量捕获的语言比实习时使用的语言更具意义？

import torch import transformers from transformers import AutoModel,AutoTokenizer bert_name="bert-base-multilingual-cased" tokenizer = AutoTokenizer.from_pretrained(bert_name) MBERT = AutoModel.from_pretrained(bert_name) #Some silly sentences eng1='A cat jumped from the trees and startled the tourists' e=tokenizer.encode(eng1, add_special_tokens=True) ans_eng1=MBERT(torch.tensor([e])) eng2='A small snake whispered secrets to large cats' t=tokenizer.tokenize(eng2) e=tokenizer.encode(eng2, add_special_tokens=True) ans_eng2=MBERT(torch.tensor([e])) eng3='A tiger sprinted from the bushes and frightened the guests' e=tokenizer.encode(eng3, add_special_tokens=True) ans_eng3=MBERT(torch.tensor([e])) # Translated to Hebrew with Google Translate heb1='חתול קפץ מהעץ והבהיל את התיירים' e=tokenizer.encode(heb1, add_special_tokens=True) ans_heb1=MBERT(torch.tensor([e])) heb2='נחש קטן לחש סודות לחתולים גדולים' e=tokenizer.encode(heb2, add_special_tokens=True) ans_heb2=MBERT(torch.tensor([e])) heb3='נמר רץ מהשיחים והפחיד את האורחים' e=tokenizer.encode(heb3, add_special_tokens=True) ans_heb3=MBERT(torch.tensor([e])) from scipy import spatial import numpy as np # Compare Sentence Embeddings result = spatial.distance.cosine(ans_eng1[1].data.numpy(), ans_heb1[1].data.numpy()) print ('Eng1-Heb1 - Translated sentences',result) result = spatial.distance.cosine(ans_eng2[1].data.numpy(), ans_heb2[1].data.numpy()) print ('Eng2-Heb2 - Translated sentences',result) result = spatial.distance.cosine(ans_eng3[1].data.numpy(), ans_heb3[1].data.numpy()) print ('Eng3-Heb3 - Translated sentences',result) print ("\n---\n") result = spatial.distance.cosine(ans_heb1[1].data.numpy(), ans_heb2[1].data.numpy()) print ('Heb1-Heb2 - Different sentences',result) result = spatial.distance.cosine(ans_eng1[1].data.numpy(), ans_eng2[1].data.numpy()) print ('Heb1-Heb3 - Similiar sentences',result) print ("\n---\n") result = spatial.distance.cosine(ans_eng1[1].data.numpy(), ans_eng2[1].data.numpy()) print ('Eng1-Eng2 - Different sentences',result) result = spatial.distance.cosine(ans_eng1[1].data.numpy(), ans_eng3[1].data.numpy()) print ('Eng1-Eng3 - Similiar sentences',result) #Output: """ Eng1-Heb1 - Translated sentences 0.2074061632156372 Eng2-Heb2 - Translated sentences 0.15557605028152466 Eng3-Heb3 - Translated sentences 0.275478720664978 --- Heb1-Heb2 - Different sentences 0.044616520404815674 Heb1-Heb3 - Similar sentences 0.027982771396636963 --- Eng1-Eng2 - Different sentences 0.027982771396636963 Eng1-Eng3 - Similar sentences 0.024596810340881348 """

2条回答

网友

1楼 · 编辑于 2024-05-29 05:52:03

[CLS]标记以某种方式表示输入序列，但其确切程度很难说。语言当然是句子的一个重要特征，可能不仅仅是意义。伯特模型是一种预训练模型，它试图对意义、结构和语言等特征进行建模。如果你想有一个模型，它可以帮助你识别两个不同语言的句子是否意味着同一件事，我可以想出两种不同的方法：

方法：你可以在这个任务上训练分类器（SVM，逻辑回归，甚至一些神经网络，比如CNN）Inputs: two [CLS]-Token, Output: Same meaning, or not same meaning. 作为训练数据，您可以选择不同语言的[CLS]-标记句对，这些句子的含义相同或不同。为了得到有意义的结果，你需要很多这样的句子对。幸运的是，你可以通过google translate生成它们，或者使用类似于圣经的平行文本（存在于许多语言中），并从中提取句子对
方法：精确调整该任务的bert模型：与前面的方法一样，您需要大量的培训数据。伯特模型的样本输入如下所示： A cat jumped from the trees and startled the tourists [SEP] חתול קפץ מהעץ והבהיל את התיירים
要对这些句子是否具有相同的含义进行分类，您需要在[CLS]-标记的顶部添加一个分类层，并在该任务上微调整个模型

注意：我从来没有使用过多语言的伯特模型，这些方法就是我想要完成上述任务的方法。如果您尝试这些方法，我很想知道它们的性能如何😊.

网友

2楼 · 编辑于 2024-05-29 05:52:03

目前还不能完全理解多语言BERT的功能以及它的工作原理。最近有两篇论文（第一篇{a1}，第二篇{a2}）对此进行了一些探讨

从论文中可以看出，向量似乎倾向于按照语言（甚至语族）进行聚类，因此对语言进行分类非常容易。这是显示为in the paper的集群：

正因为如此，你可以从表达中减去语言的平均值，最终得到一个某种程度上的跨语言向量，这两篇论文都表明可以用于跨语言句子检索

此外，似乎一千个平行句子（例如，在两种语言中）足以学习两种语言之间的投影。请注意，它们没有使用[CLS]向量，但它们表示将单个子词的向量合并在一起

相关问题更多 >

编程相关推荐

热门问题

热门文章