为什么在huggingface MT5模型中进行批量编码时，我会得到不同的嵌入？

1 投票

2 回答

74 浏览

提问于 2025-04-14 16:56

我正在尝试使用HuggingFace的mt5-base模型来编码一些文本。下面是我使用模型的方式：

from transformers import MT5EncoderModel, AutoTokenizer

model = MT5EncoderModel.from_pretrained("google/mt5-base")
tokenizer = AutoTokenizer.from_pretrained("google/mt5-base")

def get_t5_embeddings(texts):
    last_hidden_state = model(input_ids=tokenizer(texts, return_tensors="pt", padding=True).input_ids).last_hidden_state
    pooled_sentence = torch.max(last_hidden_state, dim=1)
    return pooled_sentence[0].detach().numpy()

在进行一些实验时，我发现同样的文本与自身的余弦相似度得分很低。我进一步研究后发现，如果我把编码分成多个批次进行处理，模型返回的嵌入结果会非常不同。为了验证这一点，我做了一个小实验，生成了Hello的嵌入，以及一个包含10个Hello的列表，并检查了列表中第一个Hello的嵌入（这两个应该是相同的）。

for i in range(1, 10):
    print(i, (get_t5_embeddings(["Hello"])[0] == get_t5_embeddings(["Hello"]*i)[0]).sum())

这个操作会返回嵌入中匹配的值的数量。结果如下：

每次我运行这个实验时，如果批次大小超过768，就会出现不匹配的情况。

为什么我会得到不同的嵌入结果，我该如何解决这个问题呢？

huggingface mt5 embedding batch processing cosine similarity model performance text encoding dimensionality reduction

2 个回答

你可以试着通过调用 model.eval() 来把模型设置为评估模式。通常情况下，模型在初始化时是训练模式，这时候会随机丢弃一些数据（叫做dropout）并进行归一化处理。而在评估模式下，我们需要把这些功能关闭。

回答于 2025-04-14 由 Python大师

分享举报

简而言之

主要的问题出在 pooled_sentence = torch.max(last_hidden_state, dim=1) 这一行。稍微想一下 torch.max 是在做什么，以及你在“池化”什么。

详细说明

根据你批处理中的最长句子，令牌的长度会有所不同。所以当你使用 torch.max 时，由于最大令牌的长度不同，你会得到不同的输出大小。


texts = ["hello world", "foo bar", "this is a foo bar sentence"]
last_hidden_state = model(input_ids=tokenizer(texts, return_tensors="pt", padding=True).input_ids).last_hidden_state

second_sentence_embeddings_1 = last_hidden_state[1]


texts = ["hello world", "foo bar"]
last_hidden_state = model(input_ids=tokenizer(texts, return_tensors="pt", padding=True).input_ids).last_hidden_state

second_sentence_embeddings_2 = last_hidden_state[1]

second_sentence_embeddings_1.shape, second_sentence_embeddings_2.shape

[输出]:

(torch.Size([10, 768]), torch.Size([4, 768]))

首先，通过使用 `pipeline` 来解决不同批次大小的输出问题

from transformers import pipeline

pipe = pipeline(task="feature-extraction", model="google/mt5-base", framework="pt")

# See https://github.com/huggingface/transformers/issues/20404
pipe.model = pipe.model.encoder 

hello_world = pipe(["hello world", "foo bar"], return_pt=True)

batch_mode = pipe(["hello world", "foo bar", "this is a foo bar sentence"], return_pt=True)


assert hello_world[1] == batch_mode[1]

但是我怎么才能每个句子得到一个768维的向量呢？

简而言之:

from transformers import MT5EncoderModel, AutoTokenizer
from transformers import FeatureExtractionPipeline


class LuigiThePlumber(FeatureExtractionPipeline):
    def postprocess(self, model_outputs):
        # If you just want it to return a torch.return_types.max, 
        # instead of plain tensor use 
        # `return torch.max(model_outputs.last_hidden_state, dim=1)`
        return torch.max(model_outputs.last_hidden_state, dim=1).values

model = MT5EncoderModel.from_pretrained("google/mt5-base")
tokenizer = AutoTokenizer.from_pretrained("google/mt5-base")

pipe = LuigiThePlumber(task="feature-extraction", model=model, tokenizer=tokenizer, framework="pt")

# See https://github.com/huggingface/transformers/issues/20404
pipe.model = pipe.model.encoder 


out = pipe(["hello world", "foo bar", "this is a foo bar sentence"])

print(out[0].shape, out[1].shape, out[2].shape)

[输出]:

torch.Size([1, 768]) torch.Size([1, 768]) torch.Size([1, 768])

那么，为什么我的方法不行而 `pipeline` 可以呢？

这是因为在处理批量数据时，当句子长度短于批次的最大长度时，会计算一个注意力掩码来填充到最大长度。所以当你尝试提取最后的隐藏状态时，必须考虑这个注意力掩码。

使用 pipeline 时，它会标准化你放入数据集的批次大小，通常设置为1。

但我真的想要批次大小大于1！

首先，看看：

然后在计算最大池化时，你需要考虑注意力掩码。

from transformers import MT5EncoderModel, AutoTokenizer

model = MT5EncoderModel.from_pretrained("google/mt5-base")
tokenizer = AutoTokenizer.from_pretrained("google/mt5-base")

texts = ["hello world", "foo bar", "this is a foo bar sentence"]

encoded_input = tokenizer(texts, padding=True, return_tensors='pt')
model_output = model(input_ids=encoded_input.input_ids)
attention_mask = encoded_input['attention_mask']


def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


def max_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = torch.where(attention_mask==0, -1e-9, 0.).unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.max(token_embeddings-input_mask_expanded, 1).values


mean_pooling(model_output, attention_mask).shape, max_pooling(model_output, attention_mask).shape

[输出]:

(torch.Size([3, 768]), torch.Size([3, 768]))

问：不同批次的池化嵌入值是一样的吗？

答：

from transformers import MT5EncoderModel, AutoTokenizer

model = MT5EncoderModel.from_pretrained("google/mt5-base")
tokenizer = AutoTokenizer.from_pretrained("google/mt5-base")

texts = ["hello world", "foo bar", "this is a foo bar sentence"]

encoded_input = tokenizer(texts, padding=True, return_tensors='pt')
model_output = model(input_ids=encoded_input.input_ids)
attention_mask = encoded_input['attention_mask']


def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


def max_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = torch.where(attention_mask==0, -1e-9, 0.).unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.max(token_embeddings-input_mask_expanded, 1).values


x = max_pooling(model_output, attention_mask)

text = ["hello world", "foo bar"]

encoded_input = tokenizer(texts, padding=True, return_tensors='pt')
model_output = model(input_ids=encoded_input.input_ids)
attention_mask = encoded_input['attention_mask']

y = max_pooling(model_output, attention_mask)

# Lets check all values in the "embeddings" of "hello World" in both batch sizes.
assert all(v for v in x[0] == y[0])

回答于 2025-04-14 由 Python大师

分享举报

为什么在huggingface MT5模型中进行批量编码时，我会得到不同的嵌入？

2 个回答

简而言之

详细说明

首先，通过使用 pipeline 来解决不同批次大小的输出问题

但是我怎么才能每个句子得到一个768维的向量呢？

那么，为什么我的方法不行而 pipeline 可以呢？

但我真的想要批次大小大于1！

撰写回答

首先，通过使用 `pipeline` 来解决不同批次大小的输出问题

那么，为什么我的方法不行而 `pipeline` 可以呢？