Python:BERT标记器无法加载

2024-05-23 17:05:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在处理bert-base-mutilingual-uncased模型,但是当我试图在config中设置TOKENIZER时,它抛出一个OSError

型号配置

class config: 
    DEVICE = "cuda:0"
    MAX_LEN = 256
    TRAIN_BATCH_SIZE = 8
    VALID_BATCH_SIZE = 4
    EPOCHS = 1

    BERT_PATH = {"bert-base-multilingual-uncased": "workspace/data/jigsaw-multilingual/input/bert-base-multilingual-uncased"}
    MODEL_PATH = "workspace/data/jigsaw-multilingual/model.bin"

    TOKENIZER = transformers.BertTokenizer.from_pretrained(
            BERT_PATH["bert-base-multilingual-uncased"], 
            do_lower_case=True)

错误

    ---------------------------------------------------------------------------
    OSError                                   Traceback (most recent call last)
    <ipython-input-33-83880b6b788e> in <module>
    ----> 1 class config:
          2 #     def __init__(self):
          3 
          4         DEVICE = "cuda:0"
          5         MAX_LEN = 256
    
    <ipython-input-33-83880b6b788e> in config()
         11         TOKENIZER = transformers.BertTokenizer.from_pretrained(
         12             BERT_PATH["bert-base-multilingual-uncased"],
    ---> 13             do_lower_case=True)
    
    /opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, *inputs, **kwargs)
       1138 
       1139         """
    -> 1140         return cls._from_pretrained(*inputs, **kwargs)
       1141 
       1142     @classmethod
    
    /opt/conda/lib/python3.6/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
       1244                     ", ".join(s3_models),
       1245                     pretrained_model_name_or_path,
    -> 1246                     list(cls.vocab_files_names.values()),
       1247                 )
       1248             )
    
    OSError: Model name 'workspace/data/jigsaw-multilingual/input/bert-base-multilingual-uncased' was not  
 found in tokenizers model name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking,   
bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc,   
bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1,     
wietsedv/bert-base-dutch-cased). 

We assumed 'workspace/data/jigsaw-multilingual/input/bert-base-multilingual-uncased' was a path, a model   identifier, or url to a directory containing vocabulary files named ['vocab.txt'] but couldn't find such  
 vocabulary files at this path or url.

我可以解释这个错误,它说vocab.txt文件不是在给定位置找到的,而是它的当前位置

以下是bert-base-multilingual-uncased文件夹中的可用文件:

  • config.json
  • pytorch_model.bin
  • vocab.txt

我对使用bert还不熟悉,所以我不确定是否有其他方法来定义标记器


Tags: pathinfromconfiginputdatabasemodel
1条回答
网友
1楼 · 发布于 2024-05-23 17:05:46

我认为这应该奏效:

from transformers import BertTokenizer
TOKENIZER = BertTokenizer.from_pretrained('bert-base-multilingual-uncased', do_lower_case=True)

它将从huggingface下载标记器

相关问题 更多 >