OPT 大语言模型（Large Language Model）结构

大语言模型follow GPT的做法，其基本组成结构是Decoder-only的Transformer block，多个Transformer Block堆叠在一起；

不同数量、不同Head、不同隐藏层维度构成了不同参数量的大模型（也即模型跟着的后缀，比如，6.7B）；

OPT是由Facebook（现称Meta）公式开源的大语言模型；

以OPT-6.7b模型为例，梳理OPT大模型的网络结构；

OPTConfig {"_name_or_path":"facebook/opt-6.7b","_remove_final_layer_norm": false,"activation_dropout":0.0,"activation_function":"relu","architectures":["OPTForCausalLM"],"attention_dropout":0.0,"bos_token_id":2,"do_layer_norm_before": true,"dropout":0.1,"enable_bias": true,"eos_token_id":2,"ffn_dim":16384,"hidden_size":4096,"init_std":0.02,"layer_norm_elementwise_affine": true,"layerdrop":0.0,"max_position_embeddings":2048,"model_type":"opt","num_attention_heads":32,"num_hidden_layers":32,"pad_token_id":1,"prefix":"</s>","torch_dtype":"float16","transformers_version":"4.42.2","use_cache": true,"vocab_size":50272,"word_embed_proj_dim":4096}

上面代码为OPT-6.7b模型的配置文件，里面列出了大模型的相关参数，可以重点关注的包括：

activation_function: relu

采用的激活函数为 ReLU

vocab_size: 50272

词表大小

word_embed_proj_dim: 4096

经过embedding后的Token向量的维度

ffn_dim: 16384

Transformer Block中 MLP 中FC层的隐层维度

hidden_size: 4096

隐层维度（一般与word_embed_proj_dim相同）

num_attention_heads: 32

Attention Head的数量

num_hidden_layers: 32

Transformer Block的数量

torch_dtype: float16

预训练模型参数的数据类型（大模型的参数一般都为半精度fp16，而非单精度浮点数fp32）

Pytorch中定义完Model后，print(model)可以看到网络结构，打印结果如下：

OPTForCausalLM((model): OPTModel((decoder): OPTDecoder((embed_tokens): Embedding(50272,4096, padding_idx=1)(embed_positions): OPTLearnedPositionalEmbedding(2050,4096)(final_layer_norm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)(layers): ModuleList((0-31):32 x OPTDecoderLayer((self_attn): OPTAttention((k_proj): Linear(in_features=4096, out_features=4096, bias=True)(v_proj): Linear(in_features=4096, out_features=4096, bias=True)(q_proj): Linear(in_features=4096, out_features=4096, bias=True)(out_proj): Linear(in_features=4096, out_features=4096, bias=True))(activation_fn): ReLU()(self_attn_layer_norm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)(fc1): Linear(in_features=4096, out_features=16384, bias=True)(fc2): Linear(in_features=16384, out_features=4096, bias=True)(final_layer_norm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)))))(lm_head): Linear(in_features=4096, out_features=50272, bias=False))

可以看到模型主要分为两部分：OPTModel 和 lm_head；

OPTModel 是模型的主体部分，而 lm_head 负责生成下一个Token；

OPTDecoderLayer 由 OPTAttention、MLP（也即fc1、fc2）堆叠而成，其结构如下图所示：
在这里插入图片描述

OPTAttention的结构如下图所示：
在这里插入图片描述
标准的Transformer Decoder

大模型中通常采用的KV cache机制体现在图中的：past_key_value；
计算Softmax的时候先把数据类型转换成 fp32，防止溢出；

标签： transformer 语言模型深度学习

本文转载自: https://blog.csdn.net/qq_42047140/article/details/140752439
版权归原作者 黎明沐白 所有，如有侵权，请联系我们删除。

OPT 大语言模型（Large Language Model）结构

目录

OPT 大语言模型（Large Language Model）结构

发表评论

“OPT 大语言模型（Large Language Model）结构”的评论:

关于作者

overfit同步小助手

相关阅读

文章导航