用mxnet嵌入bert令牌级
bert-embedding的Python项目详细描述
伯特嵌入
^由{a6}出版的{a5}是获得预先训练的语言模型词表示的新方法。许多NLP任务都受益于BERT来获得SOTA
本计画的目标是从bert的预训练模型中取得记号嵌入。这样,您就可以通过使用或令牌嵌入来构建模型,而不是为端到端nlp模型进行构建和微调。
这个项目是用@MXNet实现的。特别感谢@gluon-nlp团队。
安装
pip install bert-embedding
# If you want to run on GPU machine, please install `mxnet-cu92`.
pip install mxnet-cu92
用法
frombert_embeddingimportBertEmbeddingbert_abstract="""We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%."""sentences=bert_abstract.split('\n')bert_embedding=BertEmbedding()result=bert_embedding(sentences)
如果要使用gpu,请导入mxnet并设置上下文
importmxnetasmxfrombert_embeddingimportBertEmbedding...ctx=mx.gpu(0)bert=BertEmbedding(ctx=ctx)
此结果是一个元组列表,其中包含(令牌、令牌嵌入)
例如:
first_sentence=result[0]first_sentence[0]# ['we', 'introduce', 'a', 'new', 'language', 'representation', 'model', 'called', 'bert', ',', 'which', 'stands', 'for', 'bidirectional', 'encoder', 'representations', 'from', 'transformers']len(first_sentence[0])# 18len(first_sentence[1])# 18first_token_in_first_sentence=first_sentence[1]first_token_in_first_sentence[1]# array([ 0.4805648 , 0.18369392, -0.28554988, ..., -0.01961522,# 1.0207764 , -0.67167974], dtype=float32)first_token_in_first_sentence[1].shape# (768,)
oov
有三种方法可以处理oov、avg(默认)、sum和last这可以在编码中指定
...bert_embedding=BertEmbedding()bert_embedding(sentences,'sum')...
提供预先培训的伯特模型
book_corpus_wiki_en_uncased | book_corpus_wiki_en_cased | wiki_multilingual | wiki_multilingual_cased | wiki_cn | |
---|---|---|---|---|---|
bert_12_768_12 | ✓ | ✓ | ✓ | ✓ | ✓ |
bert_24_1024_16 | x | ✓ | x | x | x |
使用来自google的大型预训练bert模型的示例
frombert_embeddingimportBertEmbeddingbert_embedding=BertEmbedding(model='bert_24_1024_16',dataset_name='book_corpus_wiki_en_cased')
来源:gluonnlp