0


Qwen2-57B-A14B预训练

  1. Model config Qwen2MoeConfig {
  2. "_name_or_path": "/data/zengzijian/Qwen2-57B-A14B/",
  3. "architectures": [
  4. "Qwen2MoeForCausalLM"
  5. ],
  6. "attention_dropout": 0.0,
  7. "bos_token_id": 151643,
  8. "decoder_sparse_step": 1,
  9. "eos_token_id": 151643,
  10. "hidden_act": "silu",
  11. "hidden_size": 3584,
  12. "initializer_range": 0.02,
  13. "intermediate_size": 18944,
  14. "max_position_embeddings": 131072,
  15. "max_window_layers": 28,
  16. "mlp_only_layers": [],
  17. "model_type": "qwen2_moe",
  18. "moe_intermediate_size": 2560,
  19. "norm_topk_prob": false,
  20. "num_attention_heads": 28,
  21. "num_experts": 64,
  22. "num_experts_per_tok": 8,
  23. "num_hidden_layers": 28,
  24. "num_key_value_heads": 4,
  25. "output_router_logits": false,
  26. "rms_norm_eps": 1e-06,
  27. "rope_theta": 1000000.0,
  28. "router_aux_loss_coef": 0.001,
  29. "shared_expert_intermediate_size": 20480,
  30. "sliding_window": 131072,
  31. "tie_word_embeddings": false,
  32. "torch_dtype": "bfloat16",
  33. "transformers_version": "4.41.2",
  34. "use_cache": true,
  35. "use_sliding_window": false,
  36. "vocab_size": 151936
  37. }
  38. [INFO|modeling_utils.py:3471] 2024-07-15 15:34:49,666 >> loading weights file /data/zengzijian/Qwen2-57B-A14B/model.safetensors.index.json
  39. [INFO|modeling_utils.py:3614] 2024-07-15 15:34:49,670 >> Detected DeepSpeed ZeRO-3: activating zero.init() for this model
  40. [INFO|configuration_utils.py:962] 2024-07-15 15:34:49,674 >> Generate config GenerationConfig {
  41. "bos_token_id": 151643,
  42. "eos_token_id": 151643
  43. }
  44. Token indices sequence length is longer than the specified maximum sequence length for this model (132576 > 131072). Running this sequence through the model will result in indexing errors

Qwen2-57B-A14B是阿里巴巴开源的第二代通义千问系列大模型中的一个混合专家(Mixture of Experts, MoE)版本. 以下是该模型的一些关键详细信息:

  1. 模型规模:
  • 总参数规模: 570亿
  • 推理时激活参数: 140亿
  1. 发布信息:
  • 发布日期: 2024年6月7日
  • 预训练文件大小: 114GB
  1. 模型架构:
  • 基于Transformer架构
  • 使用SwiGLU激活函数
  • 采用混合专家(MoE)技术
  1. 性能表现: Qwen2-57B-A14B在多项评测任务中表现优异,超越了当前主流的MoE开源模型. 具体表现如下:
  • MMLU: 75.4
  • MMLU-Pro: 52.8
  • GPQA: 34.3
  • TheroemQA: 33.1
  • MT-Bench: 8.55
  • HumanEval (代码): 79.9
  • GSM8K (数学): 79.6
  • MATH: 49.1
  • C-Eval (中文): 80.5
  • AlignBench: 7.36
  1. 主要特点:
  • 在自然语言理解、知识、代码、数学及多语言等多项能力上表现出色
  • 经过精细微调,提升了代码、数学、推理、指令遵循、多语言理解等能力
  • 对齐了人类价值观,更加有帮助、诚实和安全
  1. 训练方法:
  • 采用多种自动方法获取高质量指令和偏好数据
  • 结合有监督微调、反馈模型训练和在线DPO等方法
  • 使用在线模型合并方法减少对齐成本

Qwen2-57B-A14B作为一个强大的MoE模型,在保持较小激活参数规模的同时,实现了优秀的性能表现,为大规模语言模型的应用提供了新的可能性。

deepspeed训练

日志展示了Qwen2-57B-A14B MoE模型的详细性能分析。我来解释一下关键信息:

  1. 模型架构:
  • 总参数量:57B (57亿)
  • 每次前向传播激活参数:14B (14亿)
  • 使用了MoE (Mixture of Experts)结构
  • 共有28层Decoder层
  • 每层有64个专家(experts)
  • 每次激活8个专家
  1. 并行设置:
  • 世界大小(world size):16,表示使用16个GPU
  • 数据并行度(data parallel size):16
  • 模型并行度(model parallel size):1
  • 每个GPU的batch size:16
  1. 计算性能:
  • 每个GPU的参数量:433.66K
  • 每个GPU的前向传播MACs:449.07 TMACs
  • 每个GPU的前向传播FLOPS:37.63 TFLOPS
  • 总体FLOPS:31.01 TFLOPS
  • 每秒处理样本数:2.95
  1. 延迟:
  • 前向传播延迟:23.87s
  • 后向传播延迟:61.66s
  • 参数更新延迟:1.37s
  • 总迭代延迟:86.9s
  1. 关键模块:
  • Embedding层:151936 x 3584
  • 自注意力层:使用FlashAttention2
  • MLP层:使用SparseeMoeBlock
  • 每个专家MLP的隐藏维度:2560
  1. MoE结构:
  • 门控网络:Linear layer (3584 -> 64)
  • 64个专家,每次激活8个
  • 每个专家是一个MLP,包含gate_proj, up_proj, down_proj和SiLU激活

这个MoE模型通过在每一层使用多个专家网络,实现了在保持较低激活参数量(14B)的同时,拥有较大的总参数量(57B)。这种设计能够在计算资源有限的情况下,训练和部署更大规模的模型,从而可能获得更好的性能。

  1. -------------------------- DeepSpeed Flops Profiler --------------------------
  2. gpu009: Profile Summary at step 2:
  3. gpu009: Notations:
  4. gpu009: data parallel size (dp_size), model parallel size(mp_size),
  5. gpu009: number of parameters (params), number of multiply-accumulate operations(MACs),
  6. gpu009: number of floating-point operations (flops), floating-point operations per second (FLOPS),
  7. gpu009: fwd latency (forward propagation latency), bwd latency (backward propagation latency),
  8. gpu009: step (weights update latency), iter latency (sum of fwd, bwd and step latency)
  9. gpu009:
  10. gpu009: world size: 16
  11. gpu009: data parallel size: 16
  12. gpu009: model parallel size: 1
  13. gpu009: batch size per GPU: 16
  14. gpu009: params per GPU: 433.66 K
  15. gpu009: params of model = params per GPU * mp_size: 433.66 K
  16. gpu009: fwd MACs per GPU: 449.07 TMACs
  17. gpu009: fwd flops per GPU: 898.15 T
  18. gpu009: fwd flops of model = fwd flops per GPU * mp_size: 898.15 T
  19. gpu009: fwd latency: 23.87 s
  20. gpu009: fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 37.63 TFLOPS
  21. gpu009: bwd latency: 61.66 s
  22. gpu009: bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency: 29.13 TFLOPS
  23. gpu009: fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency): 31.5 TFLOPS
  24. gpu009: step latency: 1.37 s
  25. gpu009: iter latency: 86.9 s
  26. gpu009: FLOPS per GPU = 3 * fwd flops per GPU / iter latency: 31.01 TFLOPS
  27. gpu009: samples/second: 2.95
  28. gpu009:
  29. gpu009: ----------------------------- Aggregated Profile per GPU -----------------------------
  30. gpu009: Top 3 modules in terms of params, MACs or fwd latency at different model depths:
  31. gpu009: depth 0:
  32. gpu009: params - {'Qwen2MoeForCausalLM': '433.66 K'}
  33. gpu009: MACs - {'Qwen2MoeForCausalLM': '449.07 TMACs'}
  34. gpu009: fwd latency - {'Qwen2MoeForCausalLM': '23.73 s'}
  35. gpu009: depth 1:
  36. gpu009: params - {'Qwen2MoeModel': '433.66 K', 'Linear': '0'}
  37. gpu009: MACs - {'Qwen2MoeModel': '431.22 TMACs', 'Linear': '17.84 TMACs'}
  38. gpu009: fwd latency - {'Qwen2MoeModel': '23.59 s', 'Linear': '65.17 ms'}
  39. gpu009: depth 2:
  40. gpu009: params - {'ModuleList': '430.08 K', 'Qwen2MoeRMSNorm': '3.58 K', 'Embedding': '0'}
  41. gpu009: MACs - {'ModuleList': '431.22 TMACs', 'Embedding': '0 MACs', 'Qwen2MoeRMSNorm': '0 MACs'}
  42. gpu009: fwd latency - {'ModuleList': '23.47 s', 'Qwen2MoeRMSNorm': '1.58 ms', 'Embedding': '1.48 ms'}
  43. gpu009: depth 3:
  44. gpu009: params - {'Qwen2MoeDecoderLayer': '430.08 K'}
  45. gpu009: MACs - {'Qwen2MoeDecoderLayer': '431.22 TMACs'}
  46. gpu009: fwd latency - {'Qwen2MoeDecoderLayer': '23.47 s'}
  47. gpu009: depth 4:
  48. gpu009: params - {'Qwen2MoeRMSNorm': '200.7 K', 'Qwen2MoeFlashAttention2': '129.02 K', 'Qwen2MoeSparseMoeBlock': '100.35 K'}
  49. gpu009: MACs - {'Qwen2MoeSparseMoeBlock': '404.28 TMACs', 'Qwen2MoeFlashAttention2': '26.94 TMACs', 'Qwen2MoeRMSNorm': '0 MACs'}
  50. gpu009: fwd latency - {'Qwen2MoeSparseMoeBlock': '22.17 s', 'Qwen2MoeFlashAttention2': '779.38 ms', 'Qwen2MoeRMSNorm': '90.4 ms'}
  51. gpu009: depth 5:
  52. gpu009: params - {'Linear': '229.38 K', 'Qwen2MoeRotaryEmbedding': '0', 'ModuleList': '0'}
  53. gpu009: MACs - {'ModuleList': '202.04 TMACs', 'Qwen2MoeMLP': '202.04 TMACs', 'Linear': '27.15 TMACs'}
  54. gpu009: fwd latency - {'ModuleList': '16.49 s', 'Qwen2MoeMLP': '1.14 s', 'Linear': '224.08 ms'}
  55. gpu009: depth 6:
  56. gpu009: params - {'Qwen2MoeMLP': '0', 'Linear': '0', 'SiLU': '0'}
  57. gpu009: MACs - {'Qwen2MoeMLP': '202.04 TMACs', 'Linear': '202.04 TMACs', 'SiLU': '0 MACs'}
  58. gpu009: fwd latency - {'Qwen2MoeMLP': '16.49 s', 'Linear': '713.84 ms', 'SiLU': '25.52 ms'}
  59. gpu009:
  60. gpu009: ------------------------------ Detailed Profile per GPU ------------------------------
  61. gpu009: Each module profile is listed after its name in the following order:
  62. gpu009: params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS
  63. gpu009:
  64. gpu009: Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make up the difference between a parent's MACs (or latency) and the sum of its submodules'.
  65. gpu009: 2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.
  66. gpu009: 3. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch, thus it's less than the fwd latency shown above which is captured in DeepSpeed.
  67. gpu009:
  68. gpu009: Qwen2MoeForCausalLM(
  69. gpu009: 433.66 K = 100% Params, 449.07 TMACs = 100% MACs, 23.73 s = 100% latency, 37.86 TFLOPS
  70. gpu009: (model): Qwen2MoeModel(
  71. gpu009: 433.66 K = 100% Params, 431.22 TMACs = 96.03% MACs, 23.59 s = 99.43% latency, 36.56 TFLOPS
  72. gpu009: (embed_tokens): Embedding(0 = 0% Params, 0 MACs = 0% MACs, 1.48 ms = 0.01% latency, 0 FLOPS, 151936, 3584)
  73. gpu009: (layers): ModuleList(
  74. gpu009: (0): Qwen2MoeDecoderLayer(
  75. gpu009: 15.36 K = 3.54% Params, 15.4 TMACs = 3.43% MACs, 1.56 s = 6.58% latency, 19.72 TFLOPS
  76. gpu009: (self_attn): Qwen2MoeFlashAttention2(
  77. gpu009: 4.61 K = 1.06% Params, 962.07 GMACs = 0.21% MACs, 29.87 ms = 0.13% latency, 64.41 TFLOPS
  78. gpu009: (q_proj): Linear(3.58 K = 0.83% Params, 420.91 GMACs = 0.09% MACs, 2.03 ms = 0.01% latency, 415.15 TFLOPS, in_features=3584, out_features=3584, bias=True)
  79. gpu009: (k_proj): Linear(512 = 0.12% Params, 60.13 GMACs = 0.01% MACs, 1.21 ms = 0.01% latency, 99.37 TFLOPS, in_features=3584, out_features=512, bias=True)
  80. gpu009: (v_proj): Linear(512 = 0.12% Params, 60.13 GMACs = 0.01% MACs, 1.19 ms = 0.01% latency, 100.92 TFLOPS, in_features=3584, out_features=512, bias=True)
  81. gpu009: (o_proj): Linear(0 = 0% Params, 420.91 GMACs = 0.09% MACs, 1.61 ms = 0.01% latency, 522 TFLOPS, in_features=3584, out_features=3584, bias=False)
  82. gpu009: (rotary_emb): Qwen2MoeRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 1.01 ms = 0% latency, 0 FLOPS)
  83. gpu009: )
  84. gpu009: (mlp): Qwen2MoeSparseMoeBlock(
  85. gpu009: 3.58 K = 0.83% Params, 14.44 TMACs = 3.22% MACs, 1.47 s = 6.22% latency, 19.58 TFLOPS
  86. gpu009: (gate): Linear(0 = 0% Params, 7.52 GMACs = 0% MACs, 1.14 ms = 0% latency, 13.15 TFLOPS, in_features=3584, out_features=64, bias=False)
  87. gpu009: (experts): ModuleList(
  88. gpu009: (0): Qwen2MoeMLP(
  89. gpu009: 0 = 0% Params, 97.55 GMACs = 0.02% MACs, 20.47 ms = 0.09% latency, 9.53 TFLOPS
  90. gpu009: (gate_proj): Linear(0 = 0% Params, 32.52 GMACs = 0.01% MACs, 1.15 ms = 0% latency, 56.58 TFLOPS, in_features=3584, out_features=2560, bias=False)
  91. gpu009: (up_proj): Linear(0 = 0% Params, 32.52 GMACs = 0.01% MACs, 1.15 ms = 0% latency, 56.79 TFLOPS, in_features=3584, out_features=2560, bias=False)
  92. gpu009: (down_proj): Linear(0 = 0% Params, 32.52 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 58.28 TFLOPS, in_features=2560, out_features=3584, bias=False)
  93. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.02 ms = 0% latency, 4.45 GFLOPS)
  94. gpu009: )
  95. gpu009: (1): Qwen2MoeMLP(
  96. gpu009: 0 = 0% Params, 113.57 GMACs = 0.03% MACs, 19.76 ms = 0.08% latency, 11.5 TFLOPS
  97. gpu009: (gate_proj): Linear(0 = 0% Params, 37.86 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 67.78 TFLOPS, in_features=3584, out_features=2560, bias=False)
  98. gpu009: (up_proj): Linear(0 = 0% Params, 37.86 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 67.39 TFLOPS, in_features=3584, out_features=2560, bias=False)
  99. gpu009: (down_proj): Linear(0 = 0% Params, 37.86 GMACs = 0.01% MACs, 1.14 ms = 0% latency, 66.19 TFLOPS, in_features=2560, out_features=3584, bias=False)
  100. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 986.58 us = 0% latency, 5.35 GFLOPS)
  101. gpu009: )
  102. gpu009: (2): Qwen2MoeMLP(
  103. gpu009: 0 = 0% Params, 117.09 GMACs = 0.03% MACs, 20.56 ms = 0.09% latency, 11.39 TFLOPS
  104. gpu009: (gate_proj): Linear(0 = 0% Params, 39.03 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 69.57 TFLOPS, in_features=3584, out_features=2560, bias=False)
  105. gpu009: (up_proj): Linear(0 = 0% Params, 39.03 GMACs = 0.01% MACs, 1.14 ms = 0% latency, 68.38 TFLOPS, in_features=3584, out_features=2560, bias=False)
  106. gpu009: (down_proj): Linear(0 = 0% Params, 39.03 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 69.6 TFLOPS, in_features=2560, out_features=3584, bias=False)
  107. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 996.11 us = 0% latency, 5.47 GFLOPS)
  108. gpu009: )
  109. gpu009: (3): Qwen2MoeMLP(
  110. gpu009: 0 = 0% Params, 124.58 GMACs = 0.03% MACs, 19.26 ms = 0.08% latency, 12.94 TFLOPS
  111. gpu009: (gate_proj): Linear(0 = 0% Params, 41.53 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 74.72 TFLOPS, in_features=3584, out_features=2560, bias=False)
  112. gpu009: (up_proj): Linear(0 = 0% Params, 41.53 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 73.49 TFLOPS, in_features=3584, out_features=2560, bias=False)
  113. gpu009: (down_proj): Linear(0 = 0% Params, 41.53 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 74.21 TFLOPS, in_features=2560, out_features=3584, bias=False)
  114. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 965.12 us = 0% latency, 6 GFLOPS)
  115. gpu009: )
  116. gpu009: (4): Qwen2MoeMLP(
  117. gpu009: 0 = 0% Params, 97.16 GMACs = 0.02% MACs, 20.14 ms = 0.08% latency, 9.65 TFLOPS
  118. gpu009: (gate_proj): Linear(0 = 0% Params, 32.39 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 57.43 TFLOPS, in_features=3584, out_features=2560, bias=False)
  119. gpu009: (up_proj): Linear(0 = 0% Params, 32.39 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 57.6 TFLOPS, in_features=3584, out_features=2560, bias=False)
  120. gpu009: (down_proj): Linear(0 = 0% Params, 32.39 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 57.38 TFLOPS, in_features=2560, out_features=3584, bias=False)
  121. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 969.89 us = 0% latency, 4.66 GFLOPS)
  122. gpu009: )
  123. gpu009: (5): Qwen2MoeMLP(
  124. gpu009: 0 = 0% Params, 119.35 GMACs = 0.03% MACs, 123.01 ms = 0.52% latency, 1.94 TFLOPS
  125. gpu009: (gate_proj): Linear(0 = 0% Params, 39.78 GMACs = 0.01% MACs, 1.16 ms = 0% latency, 68.79 TFLOPS, in_features=3584, out_features=2560, bias=False)
  126. gpu009: (up_proj): Linear(0 = 0% Params, 39.78 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 70.66 TFLOPS, in_features=3584, out_features=2560, bias=False)
  127. gpu009: (down_proj): Linear(0 = 0% Params, 39.78 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 71.29 TFLOPS, in_features=2560, out_features=3584, bias=False)
  128. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 967.74 us = 0% latency, 5.74 GFLOPS)
  129. gpu009: )
  130. gpu009: (6): Qwen2MoeMLP(
  131. gpu009: 0 = 0% Params, 106.63 GMACs = 0.02% MACs, 177.81 ms = 0.75% latency, 1.2 TFLOPS
  132. gpu009: (gate_proj): Linear(0 = 0% Params, 35.54 GMACs = 0.01% MACs, 1.14 ms = 0% latency, 62.35 TFLOPS, in_features=3584, out_features=2560, bias=False)
  133. gpu009: (up_proj): Linear(0 = 0% Params, 35.54 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 63.63 TFLOPS, in_features=3584, out_features=2560, bias=False)
  134. gpu009: (down_proj): Linear(0 = 0% Params, 35.54 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 63.25 TFLOPS, in_features=2560, out_features=3584, bias=False)
  135. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 981.57 us = 0% latency, 5.05 GFLOPS)
  136. gpu009: )
  137. gpu009: (7): Qwen2MoeMLP(
  138. gpu009: 0 = 0% Params, 104.98 GMACs = 0.02% MACs, 85.97 ms = 0.36% latency, 2.44 TFLOPS
  139. gpu009: (gate_proj): Linear(0 = 0% Params, 34.99 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 61.93 TFLOPS, in_features=3584, out_features=2560, bias=False)
  140. gpu009: (up_proj): Linear(0 = 0% Params, 34.99 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 62.02 TFLOPS, in_features=3584, out_features=2560, bias=False)
  141. gpu009: (down_proj): Linear(0 = 0% Params, 34.99 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 62.59 TFLOPS, in_features=2560, out_features=3584, bias=False)
  142. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 977.99 us = 0% latency, 4.99 GFLOPS)
  143. gpu009: )
  144. gpu009: (8): Qwen2MoeMLP(
  145. gpu009: 0 = 0% Params, 96.28 GMACs = 0.02% MACs, 41.67 ms = 0.18% latency, 4.62 TFLOPS
  146. gpu009: (gate_proj): Linear(0 = 0% Params, 32.09 GMACs = 0.01% MACs, 1.14 ms = 0% latency, 56.35 TFLOPS, in_features=3584, out_features=2560, bias=False)
  147. gpu009: (up_proj): Linear(0 = 0% Params, 32.09 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 57.4 TFLOPS, in_features=3584, out_features=2560, bias=False)
  148. gpu009: (down_proj): Linear(0 = 0% Params, 32.09 GMACs = 0.01% MACs, 1.56 ms = 0.01% latency, 41.03 TFLOPS, in_features=2560, out_features=3584, bias=False)
  149. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 973.7 us = 0% latency, 4.6 GFLOPS)
  150. gpu009: )
  151. gpu009: (9): Qwen2MoeMLP(
  152. gpu009: 0 = 0% Params, 108.72 GMACs = 0.02% MACs, 24.45 ms = 0.1% latency, 8.89 TFLOPS
  153. gpu009: (gate_proj): Linear(0 = 0% Params, 36.24 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 64.22 TFLOPS, in_features=3584, out_features=2560, bias=False)
  154. gpu009: (up_proj): Linear(0 = 0% Params, 36.24 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 64.75 TFLOPS, in_features=3584, out_features=2560, bias=False)
  155. gpu009: (down_proj): Linear(0 = 0% Params, 36.24 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 65.23 TFLOPS, in_features=2560, out_features=3584, bias=False)
  156. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 972.99 us = 0% latency, 5.2 GFLOPS)
  157. gpu009: )
  158. gpu009: (10): Qwen2MoeMLP(
  159. gpu009: 0 = 0% Params, 104.27 GMACs = 0.02% MACs, 21.43 ms = 0.09% latency, 9.73 TFLOPS
  160. gpu009: (gate_proj): Linear(0 = 0% Params, 34.76 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 61.52 TFLOPS, in_features=3584, out_features=2560, bias=False)
  161. gpu009: (up_proj): Linear(0 = 0% Params, 34.76 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 61.86 TFLOPS, in_features=3584, out_features=2560, bias=False)
  162. gpu009: (down_proj): Linear(0 = 0% Params, 34.76 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 61.52 TFLOPS, in_features=2560, out_features=3584, bias=False)
  163. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 965.6 us = 0% latency, 5.02 GFLOPS)
  164. gpu009: )
  165. gpu009: (11): Qwen2MoeMLP(
  166. gpu009: 0 = 0% Params, 107.73 GMACs = 0.02% MACs, 20.38 ms = 0.09% latency, 10.57 TFLOPS
  167. gpu009: (gate_proj): Linear(0 = 0% Params, 35.91 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 63.85 TFLOPS, in_features=3584, out_features=2560, bias=False)
  168. gpu009: (up_proj): Linear(0 = 0% Params, 35.91 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 63.9 TFLOPS, in_features=3584, out_features=2560, bias=False)
  169. gpu009: (down_proj): Linear(0 = 0% Params, 35.91 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 64.91 TFLOPS, in_features=2560, out_features=3584, bias=False)
  170. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 970.6 us = 0% latency, 5.16 GFLOPS)
  171. gpu009: )
  172. gpu009: (12): Qwen2MoeMLP(
  173. gpu009: 0 = 0% Params, 115.22 GMACs = 0.03% MACs, 11.8 ms = 0.05% latency, 19.52 TFLOPS
  174. gpu009: (gate_proj): Linear(0 = 0% Params, 38.41 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 68.36 TFLOPS, in_features=3584, out_features=2560, bias=False)
  175. gpu009: (up_proj): Linear(0 = 0% Params, 38.41 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 69 TFLOPS, in_features=3584, out_features=2560, bias=False)
  176. gpu009: (down_proj): Linear(0 = 0% Params, 38.41 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 69.08 TFLOPS, in_features=2560, out_features=3584, bias=False)
  177. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 980.85 us = 0% latency, 5.46 GFLOPS)
  178. gpu009: )
  179. gpu009: (13): Qwen2MoeMLP(
  180. gpu009: 0 = 0% Params, 107.9 GMACs = 0.02% MACs, 9.29 ms = 0.04% latency, 23.23 TFLOPS
  181. gpu009: (gate_proj): Linear(0 = 0% Params, 35.97 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 64.34 TFLOPS, in_features=3584, out_features=2560, bias=False)
  182. gpu009: (up_proj): Linear(0 = 0% Params, 35.97 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 65.42 TFLOPS, in_features=3584, out_features=2560, bias=False)
  183. gpu009: (down_proj): Linear(0 = 0% Params, 35.97 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 64.97 TFLOPS, in_features=2560, out_features=3584, bias=False)
  184. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 981.81 us = 0% latency, 5.11 GFLOPS)
  185. gpu009: )
  186. gpu009: (14): Qwen2MoeMLP(
  187. gpu009: 0 = 0% Params, 96.89 GMACs = 0.02% MACs, 9.83 ms = 0.04% latency, 19.72 TFLOPS
  188. gpu009: (gate_proj): Linear(0 = 0% Params, 32.3 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 57.7 TFLOPS, in_features=3584, out_features=2560, bias=False)
  189. gpu009: (up_proj): Linear(0 = 0% Params, 32.3 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 58.17 TFLOPS, in_features=3584, out_features=2560, bias=False)
  190. gpu009: (down_proj): Linear(0 = 0% Params, 32.3 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 58.24 TFLOPS, in_features=2560, out_features=3584, bias=False)
  191. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 987.29 us = 0% latency, 4.56 GFLOPS)
  192. gpu009: )
  193. gpu009: (15): Qwen2MoeMLP(
  194. gpu009: 0 = 0% Params, 116.6 GMACs = 0.03% MACs, 9.51 ms = 0.04% latency, 24.53 TFLOPS
  195. gpu009: (gate_proj): Linear(0 = 0% Params, 38.87 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 69.68 TFLOPS, in_features=3584, out_features=2560, bias=False)
  196. gpu009: (up_proj): Linear(0 = 0% Params, 38.87 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 70.61 TFLOPS, in_features=3584, out_features=2560, bias=False)
  197. gpu009: (down_proj): Linear(0 = 0% Params, 38.87 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 69.31 TFLOPS, in_features=2560, out_features=3584, bias=False)
  198. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 982.76 us = 0% latency, 5.52 GFLOPS)
  199. gpu009: )
  200. gpu009: (16): Qwen2MoeMLP(
  201. gpu009: 0 = 0% Params, 104.54 GMACs = 0.02% MACs, 9.83 ms = 0.04% latency, 21.27 TFLOPS
  202. gpu009: (gate_proj): Linear(0 = 0% Params, 34.85 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 62.37 TFLOPS, in_features=3584, out_features=2560, bias=False)
  203. gpu009: (up_proj): Linear(0 = 0% Params, 34.85 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 63.23 TFLOPS, in_features=3584, out_features=2560, bias=False)
  204. gpu009: (down_proj): Linear(0 = 0% Params, 34.85 GMACs = 0.01% MACs, 1.14 ms = 0% latency, 61.31 TFLOPS, in_features=2560, out_features=3584, bias=False)
  205. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 985.15 us = 0% latency, 4.93 GFLOPS)
  206. gpu009: )
  207. gpu009: (17): Qwen2MoeMLP(
  208. gpu009: 0 = 0% Params, 100.08 GMACs = 0.02% MACs, 9.41 ms = 0.04% latency, 21.27 TFLOPS
  209. gpu009: (gate_proj): Linear(0 = 0% Params, 33.36 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 59.1 TFLOPS, in_features=3584, out_features=2560, bias=False)
  210. gpu009: (up_proj): Linear(0 = 0% Params, 33.36 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 60.57 TFLOPS, in_features=3584, out_features=2560, bias=False)
  211. gpu009: (down_proj): Linear(0 = 0% Params, 33.36 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 58.98 TFLOPS, in_features=2560, out_features=3584, bias=False)
  212. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 984.91 us = 0% latency, 4.73 GFLOPS)
  213. gpu009: )
  214. gpu009: (18): Qwen2MoeMLP(
  215. gpu009: 0 = 0% Params, 101.57 GMACs = 0.02% MACs, 10.06 ms = 0.04% latency, 20.18 TFLOPS
  216. gpu009: (gate_proj): Linear(0 = 0% Params, 33.86 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 59.9 TFLOPS, in_features=3584, out_features=2560, bias=False)
  217. gpu009: (up_proj): Linear(0 = 0% Params, 33.86 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 60.07 TFLOPS, in_features=3584, out_features=2560, bias=False)
  218. gpu009: (down_proj): Linear(0 = 0% Params, 33.86 GMACs = 0.01% MACs, 1.15 ms = 0% latency, 58.79 TFLOPS, in_features=2560, out_features=3584, bias=False)
  219. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 976.32 us = 0% latency, 4.84 GFLOPS)
  220. gpu009: )
  221. gpu009: (19): Qwen2MoeMLP(
  222. gpu009: 0 = 0% Params, 109.55 GMACs = 0.02% MACs, 9.31 ms = 0.04% latency, 23.53 TFLOPS
  223. gpu009: (gate_proj): Linear(0 = 0% Params, 36.52 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 64.48 TFLOPS, in_features=3584, out_features=2560, bias=False)
  224. gpu009: (up_proj): Linear(0 = 0% Params, 36.52 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 65.75 TFLOPS, in_features=3584, out_features=2560, bias=False)
  225. gpu009: (down_proj): Linear(0 = 0% Params, 36.52 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 65.82 TFLOPS, in_features=2560, out_features=3584, bias=False)
  226. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 973.46 us = 0% latency, 5.23 GFLOPS)
  227. gpu009: )
  228. gpu009: (20): Qwen2MoeMLP(
  229. gpu009: 0 = 0% Params, 109.94 GMACs = 0.02% MACs, 10.15 ms = 0.04% latency, 21.67 TFLOPS
  230. gpu009: (gate_proj): Linear(0 = 0% Params, 36.65 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 65.43 TFLOPS, in_features=3584, out_features=2560, bias=False)
  231. gpu009: (up_proj): Linear(0 = 0% Params, 36.65 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 65.47 TFLOPS, in_features=3584, out_features=2560, bias=False)
  232. gpu009: (down_proj): Linear(0 = 0% Params, 36.65 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 65.16 TFLOPS, in_features=2560, out_features=3584, bias=False)
  233. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 969.89 us = 0% latency, 5.27 GFLOPS)
  234. gpu009: )
  235. gpu009: (21): Qwen2MoeMLP(
  236. gpu009: 0 = 0% Params, 114.45 GMACs = 0.03% MACs, 9.45 ms = 0.04% latency, 24.23 TFLOPS
  237. gpu009: (gate_proj): Linear(0 = 0% Params, 38.15 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 68.1 TFLOPS, in_features=3584, out_features=2560, bias=False)
  238. gpu009: (up_proj): Linear(0 = 0% Params, 38.15 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 68.79 TFLOPS, in_features=3584, out_features=2560, bias=False)
  239. gpu009: (down_proj): Linear(0 = 0% Params, 38.15 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 67.69 TFLOPS, in_features=2560, out_features=3584, bias=False)
  240. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 971.32 us = 0% latency, 5.48 GFLOPS)
  241. gpu009: )
  242. gpu009: (22): Qwen2MoeMLP(
  243. gpu009: 0 = 0% Params, 107.46 GMACs = 0.02% MACs, 9.67 ms = 0.04% latency, 22.22 TFLOPS
  244. gpu009: (gate_proj): Linear(0 = 0% Params, 35.82 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 63.44 TFLOPS, in_features=3584, out_features=2560, bias=False)
  245. gpu009: (up_proj): Linear(0 = 0% Params, 35.82 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 64 TFLOPS, in_features=3584, out_features=2560, bias=False)
  246. gpu009: (down_proj): Linear(0 = 0% Params, 35.82 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 64.03 TFLOPS, in_features=2560, out_features=3584, bias=False)
  247. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 972.27 us = 0% latency, 5.14 GFLOPS)
  248. gpu009: )
  249. gpu009: (23): Qwen2MoeMLP(
  250. gpu009: 0 = 0% Params, 104.32 GMACs = 0.02% MACs, 9.43 ms = 0.04% latency, 22.12 TFLOPS
  251. gpu009: (gate_proj): Linear(0 = 0% Params, 34.77 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 62.48 TFLOPS, in_features=3584, out_features=2560, bias=False)
  252. gpu009: (up_proj): Linear(0 = 0% Params, 34.77 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 63.04 TFLOPS, in_features=3584, out_features=2560, bias=False)
  253. gpu009: (down_proj): Linear(0 = 0% Params, 34.77 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 62.61 TFLOPS, in_features=2560, out_features=3584, bias=False)
  254. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 981.09 us = 0% latency, 4.94 GFLOPS)
  255. gpu009: )
  256. gpu009: (24): Qwen2MoeMLP(
  257. gpu009: 0 = 0% Params, 123.53 GMACs = 0.03% MACs, 9.72 ms = 0.04% latency, 25.41 TFLOPS
  258. gpu009: (gate_proj): Linear(0 = 0% Params, 41.18 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 74.4 TFLOPS, in_features=3584, out_features=2560, bias=False)
  259. gpu009: (up_proj): Linear(0 = 0% Params, 41.18 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 75.14 TFLOPS, in_features=3584, out_features=2560, bias=False)
  260. gpu009: (down_proj): Linear(0 = 0% Params, 41.18 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 73.62 TFLOPS, in_features=2560, out_features=3584, bias=False)
  261. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 969.41 us = 0% latency, 5.93 GFLOPS)
  262. gpu009: )
  263. gpu009: (25): Qwen2MoeMLP(
  264. gpu009: 0 = 0% Params, 108.72 GMACs = 0.02% MACs, 9.49 ms = 0.04% latency, 22.92 TFLOPS
  265. gpu009: (gate_proj): Linear(0 = 0% Params, 36.24 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 64.66 TFLOPS, in_features=3584, out_features=2560, bias=False)
  266. gpu009: (up_proj): Linear(0 = 0% Params, 36.24 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 65.93 TFLOPS, in_features=3584, out_features=2560, bias=False)
  267. gpu009: (down_proj): Linear(0 = 0% Params, 36.24 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 65.8 TFLOPS, in_features=2560, out_features=3584, bias=False)
  268. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 971.08 us = 0% latency, 5.21 GFLOPS)
  269. gpu009: )
  270. gpu009: (26): Qwen2MoeMLP(
  271. gpu009: 0 = 0% Params, 102.06 GMACs = 0.02% MACs, 10.03 ms = 0.04% latency, 20.35 TFLOPS
  272. gpu009: (gate_proj): Linear(0 = 0% Params, 34.02 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 60.54 TFLOPS, in_features=3584, out_features=2560, bias=False)
  273. gpu009: (up_proj): Linear(0 = 0% Params, 34.02 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 61.65 TFLOPS, in_features=3584, out_features=2560, bias=False)
  274. gpu009: (down_proj): Linear(0 = 0% Params, 34.02 GMACs = 0.01% MACs, 1.14 ms = 0% latency, 59.7 TFLOPS, in_features=2560, out_features=3584, bias=False)
  275. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 993.01 us = 0% latency, 4.78 GFLOPS)
  276. gpu009: )
  277. gpu009: (27): Qwen2MoeMLP(
  278. gpu009: 0 = 0% Params, 92.48 GMACs = 0.02% MACs, 9.55 ms = 0.04% latency, 19.36 TFLOPS
  279. gpu009: (gate_proj): Linear(0 = 0% Params, 30.83 GMACs = 0.01% MACs, 1.15 ms = 0% latency, 53.82 TFLOPS, in_features=3584, out_features=2560, bias=False)
  280. gpu009: (up_proj): Linear(0 = 0% Params, 30.83 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 56.26 TFLOPS, in_features=3584, out_features=2560, bias=False)
  281. gpu009: (down_proj): Linear(0 = 0% Params, 30.83 GMACs = 0.01% MACs, 1.16 ms = 0% latency, 53.29 TFLOPS, in_features=2560, out_features=3584, bias=False)
  282. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 988.72 us = 0% latency, 4.35 GFLOPS)
  283. gpu009: )
  284. gpu009: (28): Qwen2MoeMLP(
  285. gpu009: 0 = 0% Params, 108.83 GMACs = 0.02% MACs, 6.6 ms = 0.03% latency, 32.99 TFLOPS
  286. gpu009: (gate_proj): Linear(0 = 0% Params, 36.28 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 65.49 TFLOPS, in_features=3584, out_features=2560, bias=False)
  287. gpu009: (up_proj): Linear(0 = 0% Params, 36.28 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 65.79 TFLOPS, in_features=3584, out_features=2560, bias=False)
  288. gpu009: (down_proj): Linear(0 = 0% Params, 36.28 GMACs = 0.01% MACs, 1.08 ms = 0% latency, 67.19 TFLOPS, in_features=2560, out_features=3584, bias=False)
  289. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 956.06 us = 0% latency, 5.29 GFLOPS)
  290. gpu009: )
  291. gpu009: (29): Qwen2MoeMLP(
  292. gpu009: 0 = 0% Params, 117.97 GMACs = 0.03% MACs, 6.04 ms = 0.03% latency, 39.09 TFLOPS
  293. gpu009: (gate_proj): Linear(0 = 0% Params, 39.32 GMACs = 0.01% MACs, 1.08 ms = 0% latency, 73.16 TFLOPS, in_features=3584, out_features=2560, bias=False)
  294. gpu009: (up_proj): Linear(0 = 0% Params, 39.32 GMACs = 0.01% MACs, 1.07 ms = 0% latency, 73.22 TFLOPS, in_features=3584, out_features=2560, bias=False)
  295. gpu009: (down_proj): Linear(0 = 0% Params, 39.32 GMACs = 0.01% MACs, 1.06 ms = 0% latency, 74.48 TFLOPS, in_features=2560, out_features=3584, bias=False)
  296. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 936.75 us = 0% latency, 5.86 GFLOPS)
  297. gpu009: )
  298. gpu009: (30): Qwen2MoeMLP(
  299. gpu009: 0 = 0% Params, 123.86 GMACs = 0.03% MACs, 20.87 ms = 0.09% latency, 11.87 TFLOPS
  300. gpu009: (gate_proj): Linear(0 = 0% Params, 41.29 GMACs = 0.01% MACs, 1.07 ms = 0% latency, 76.93 TFLOPS, in_features=3584, out_features=2560, bias=False)
  301. gpu009: (up_proj): Linear(0 = 0% Params, 41.29 GMACs = 0.01% MACs, 1.05 ms = 0% latency, 78.82 TFLOPS, in_features=3584, out_features=2560, bias=False)
  302. gpu009: (down_proj): Linear(0 = 0% Params, 41.29 GMACs = 0.01% MACs, 1.08 ms = 0% latency, 76.19 TFLOPS, in_features=2560, out_features=3584, bias=False)
  303. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 912.43 us = 0% latency, 6.31 GFLOPS)
  304. gpu009: )
  305. gpu009: (31): Qwen2MoeMLP(
  306. gpu009: 0 = 0% Params, 119.02 GMACs = 0.03% MACs, 5.93 ms = 0.02% latency, 40.17 TFLOPS
  307. gpu009: (gate_proj): Linear(0 = 0% Params, 39.67 GMACs = 0.01% MACs, 1.07 ms = 0% latency, 73.92 TFLOPS, in_features=3584, out_features=2560, bias=False)
  308. gpu009: (up_proj): Linear(0 = 0% Params, 39.67 GMACs = 0.01% MACs, 1.04 ms = 0% latency, 76.28 TFLOPS, in_features=3584, out_features=2560, bias=False)
  309. gpu009: (down_proj): Linear(0 = 0% Params, 39.67 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 77.45 TFLOPS, in_features=2560, out_features=3584, bias=False)
  310. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 906.47 us = 0% latency, 6.11 GFLOPS)
  311. gpu009: )
  312. gpu009: (32): Qwen2MoeMLP(
  313. gpu009: 0 = 0% Params, 101.9 GMACs = 0.02% MACs, 6.17 ms = 0.03% latency, 33.04 TFLOPS
  314. gpu009: (gate_proj): Linear(0 = 0% Params, 33.97 GMACs = 0.01% MACs, 1.05 ms = 0% latency, 64.8 TFLOPS, in_features=3584, out_features=2560, bias=False)
  315. gpu009: (up_proj): Linear(0 = 0% Params, 33.97 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 66.45 TFLOPS, in_features=3584, out_features=2560, bias=False)
  316. gpu009: (down_proj): Linear(0 = 0% Params, 33.97 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 66.6 TFLOPS, in_features=2560, out_features=3584, bias=False)
  317. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 881.67 us = 0% latency, 5.37 GFLOPS)
  318. gpu009: )
  319. gpu009: (33): Qwen2MoeMLP(
  320. gpu009: 0 = 0% Params, 135.7 GMACs = 0.03% MACs, 19.29 ms = 0.08% latency, 14.07 TFLOPS
  321. gpu009: (gate_proj): Linear(0 = 0% Params, 45.23 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 88.53 TFLOPS, in_features=3584, out_features=2560, bias=False)
  322. gpu009: (up_proj): Linear(0 = 0% Params, 45.23 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 89.79 TFLOPS, in_features=3584, out_features=2560, bias=False)
  323. gpu009: (down_proj): Linear(0 = 0% Params, 45.23 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 89.74 TFLOPS, in_features=2560, out_features=3584, bias=False)
  324. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 906.23 us = 0% latency, 6.96 GFLOPS)
  325. gpu009: )
  326. gpu009: (34): Qwen2MoeMLP(
  327. gpu009: 0 = 0% Params, 96.45 GMACs = 0.02% MACs, 5.97 ms = 0.03% latency, 32.32 TFLOPS
  328. gpu009: (gate_proj): Linear(0 = 0% Params, 32.15 GMACs = 0.01% MACs, 995.16 us = 0% latency, 64.61 TFLOPS, in_features=3584, out_features=2560, bias=False)
  329. gpu009: (up_proj): Linear(0 = 0% Params, 32.15 GMACs = 0.01% MACs, 989.2 us = 0% latency, 65 TFLOPS, in_features=3584, out_features=2560, bias=False)
  330. gpu009: (down_proj): Linear(0 = 0% Params, 32.15 GMACs = 0.01% MACs, 978.23 us = 0% latency, 65.73 TFLOPS, in_features=2560, out_features=3584, bias=False)
  331. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 849.96 us = 0% latency, 5.28 GFLOPS)
  332. gpu009: )
  333. gpu009: (35): Qwen2MoeMLP(
  334. gpu009: 0 = 0% Params, 131.29 GMACs = 0.03% MACs, 5.62 ms = 0.02% latency, 46.76 TFLOPS
  335. gpu009: (gate_proj): Linear(0 = 0% Params, 43.76 GMACs = 0.01% MACs, 989.44 us = 0% latency, 88.46 TFLOPS, in_features=3584, out_features=2560, bias=False)
  336. gpu009: (up_proj): Linear(0 = 0% Params, 43.76 GMACs = 0.01% MACs, 957.49 us = 0% latency, 91.42 TFLOPS, in_features=3584, out_features=2560, bias=False)
  337. gpu009: (down_proj): Linear(0 = 0% Params, 43.76 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 85.98 TFLOPS, in_features=2560, out_features=3584, bias=False)
  338. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 827.79 us = 0% latency, 7.38 GFLOPS)
  339. gpu009: )
  340. gpu009: (36): Qwen2MoeMLP(
  341. gpu009: 0 = 0% Params, 120.95 GMACs = 0.03% MACs, 9.07 ms = 0.04% latency, 26.66 TFLOPS
  342. gpu009: (gate_proj): Linear(0 = 0% Params, 40.32 GMACs = 0.01% MACs, 1 ms = 0% latency, 80.29 TFLOPS, in_features=3584, out_features=2560, bias=False)
  343. gpu009: (up_proj): Linear(0 = 0% Params, 40.32 GMACs = 0.01% MACs, 986.34 us = 0% latency, 81.75 TFLOPS, in_features=3584, out_features=2560, bias=False)
  344. gpu009: (down_proj): Linear(0 = 0% Params, 40.32 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 78.98 TFLOPS, in_features=2560, out_features=3584, bias=False)
  345. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 859.98 us = 0% latency, 6.54 GFLOPS)
  346. gpu009: )
  347. gpu009: (37): Qwen2MoeMLP(
  348. gpu009: 0 = 0% Params, 119.9 GMACs = 0.03% MACs, 9.09 ms = 0.04% latency, 26.38 TFLOPS
  349. gpu009: (gate_proj): Linear(0 = 0% Params, 39.97 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 78.11 TFLOPS, in_features=3584, out_features=2560, bias=False)
  350. gpu009: (up_proj): Linear(0 = 0% Params, 39.97 GMACs = 0.01% MACs, 1.03 ms = 0% latency, 77.97 TFLOPS, in_features=3584, out_features=2560, bias=False)
  351. gpu009: (down_proj): Linear(0 = 0% Params, 39.97 GMACs = 0.01% MACs, 1.04 ms = 0% latency, 76.88 TFLOPS, in_features=2560, out_features=3584, bias=False)
  352. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 883.1 us = 0% latency, 6.31 GFLOPS)
  353. gpu009: )
  354. gpu009: (38): Qwen2MoeMLP(
  355. gpu009: 0 = 0% Params, 132.29 GMACs = 0.03% MACs, 12.38 ms = 0.05% latency, 21.38 TFLOPS
  356. gpu009: (gate_proj): Linear(0 = 0% Params, 44.1 GMACs = 0.01% MACs, 1.04 ms = 0% latency, 84.8 TFLOPS, in_features=3584, out_features=2560, bias=False)
  357. gpu009: (up_proj): Linear(0 = 0% Params, 44.1 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 87.49 TFLOPS, in_features=3584, out_features=2560, bias=False)
  358. gpu009: (down_proj): Linear(0 = 0% Params, 44.1 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 86.53 TFLOPS, in_features=2560, out_features=3584, bias=False)
  359. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 868.56 us = 0% latency, 7.08 GFLOPS)
  360. gpu009: )
  361. gpu009: (39): Qwen2MoeMLP(
  362. gpu009: 0 = 0% Params, 126.56 GMACs = 0.03% MACs, 20.56 ms = 0.09% latency, 12.31 TFLOPS
  363. gpu009: (gate_proj): Linear(0 = 0% Params, 42.19 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 82.78 TFLOPS, in_features=3584, out_features=2560, bias=False)
  364. gpu009: (up_proj): Linear(0 = 0% Params, 42.19 GMACs = 0.01% MACs, 984.43 us = 0% latency, 85.71 TFLOPS, in_features=3584, out_features=2560, bias=False)
  365. gpu009: (down_proj): Linear(0 = 0% Params, 42.19 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 82.84 TFLOPS, in_features=2560, out_features=3584, bias=False)
  366. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 869.27 us = 0% latency, 6.77 GFLOPS)
  367. gpu009: )
  368. gpu009: (40): Qwen2MoeMLP(
  369. gpu009: 0 = 0% Params, 120.01 GMACs = 0.03% MACs, 15.5 ms = 0.07% latency, 15.49 TFLOPS
  370. gpu009: (gate_proj): Linear(0 = 0% Params, 40 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 78.39 TFLOPS, in_features=3584, out_features=2560, bias=False)
  371. gpu009: (up_proj): Linear(0 = 0% Params, 40 GMACs = 0.01% MACs, 998.97 us = 0% latency, 80.09 TFLOPS, in_features=3584, out_features=2560, bias=False)
  372. gpu009: (down_proj): Linear(0 = 0% Params, 40 GMACs = 0.01% MACs, 1.03 ms = 0% latency, 77.62 TFLOPS, in_features=2560, out_features=3584, bias=False)
  373. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 876.19 us = 0% latency, 6.37 GFLOPS)
  374. gpu009: )
  375. gpu009: (41): Qwen2MoeMLP(
  376. gpu009: 0 = 0% Params, 112.36 GMACs = 0.03% MACs, 18.06 ms = 0.08% latency, 12.44 TFLOPS
  377. gpu009: (gate_proj): Linear(0 = 0% Params, 37.45 GMACs = 0.01% MACs, 1.03 ms = 0% latency, 72.96 TFLOPS, in_features=3584, out_features=2560, bias=False)
  378. gpu009: (up_proj): Linear(0 = 0% Params, 37.45 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 73.65 TFLOPS, in_features=3584, out_features=2560, bias=False)
  379. gpu009: (down_proj): Linear(0 = 0% Params, 37.45 GMACs = 0.01% MACs, 1.03 ms = 0% latency, 72.89 TFLOPS, in_features=2560, out_features=3584, bias=False)
  380. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 875.47 us = 0% latency, 5.97 GFLOPS)
  381. gpu009: )
  382. gpu009: (42): Qwen2MoeMLP(
  383. gpu009: 0 = 0% Params, 133.72 GMACs = 0.03% MACs, 9.34 ms = 0.04% latency, 28.64 TFLOPS
  384. gpu009: (gate_proj): Linear(0 = 0% Params, 44.57 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 87.12 TFLOPS, in_features=3584, out_features=2560, bias=False)
  385. gpu009: (up_proj): Linear(0 = 0% Params, 44.57 GMACs = 0.01% MACs, 998.02 us = 0% latency, 89.32 TFLOPS, in_features=3584, out_features=2560, bias=False)
  386. gpu009: (down_proj): Linear(0 = 0% Params, 44.57 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 87.05 TFLOPS, in_features=2560, out_features=3584, bias=False)
  387. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 872.85 us = 0% latency, 7.12 GFLOPS)
  388. gpu009: )
  389. gpu009: (43): Qwen2MoeMLP(
  390. gpu009: 0 = 0% Params, 126.01 GMACs = 0.03% MACs, 9.34 ms = 0.04% latency, 26.98 TFLOPS
  391. gpu009: (gate_proj): Linear(0 = 0% Params, 42 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 82.44 TFLOPS, in_features=3584, out_features=2560, bias=False)
  392. gpu009: (up_proj): Linear(0 = 0% Params, 42 GMACs = 0.01% MACs, 982.28 us = 0% latency, 85.52 TFLOPS, in_features=3584, out_features=2560, bias=False)
  393. gpu009: (down_proj): Linear(0 = 0% Params, 42 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 82.38 TFLOPS, in_features=2560, out_features=3584, bias=False)
  394. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 895.98 us = 0% latency, 6.54 GFLOPS)
  395. gpu009: )
  396. gpu009: (44): Qwen2MoeMLP(
  397. gpu009: 0 = 0% Params, 121.44 GMACs = 0.03% MACs, 9.39 ms = 0.04% latency, 25.88 TFLOPS
  398. gpu009: (gate_proj): Linear(0 = 0% Params, 40.48 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 79.92 TFLOPS, in_features=3584, out_features=2560, bias=False)
  399. gpu009: (up_proj): Linear(0 = 0% Params, 40.48 GMACs = 0.01% MACs, 993.01 us = 0% latency, 81.53 TFLOPS, in_features=3584, out_features=2560, bias=False)
  400. gpu009: (down_proj): Linear(0 = 0% Params, 40.48 GMACs = 0.01% MACs, 1.04 ms = 0% latency, 77.65 TFLOPS, in_features=2560, out_features=3584, bias=False)
  401. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 880 us = 0% latency, 6.42 GFLOPS)
  402. gpu009: )
  403. gpu009: (45): Qwen2MoeMLP(
  404. gpu009: 0 = 0% Params, 114.23 GMACs = 0.03% MACs, 9.05 ms = 0.04% latency, 25.23 TFLOPS
  405. gpu009: (gate_proj): Linear(0 = 0% Params, 38.08 GMACs = 0.01% MACs, 1.03 ms = 0% latency, 74.02 TFLOPS, in_features=3584, out_features=2560, bias=False)
  406. gpu009: (up_proj): Linear(0 = 0% Params, 38.08 GMACs = 0.01% MACs, 993.49 us = 0% latency, 76.65 TFLOPS, in_features=3584, out_features=2560, bias=False)
  407. gpu009: (down_proj): Linear(0 = 0% Params, 38.08 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 75.39 TFLOPS, in_features=2560, out_features=3584, bias=False)
  408. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 882.86 us = 0% latency, 6.02 GFLOPS)
  409. gpu009: )
  410. gpu009: (46): Qwen2MoeMLP(
  411. gpu009: 0 = 0% Params, 123.53 GMACs = 0.03% MACs, 9.4 ms = 0.04% latency, 26.28 TFLOPS
  412. gpu009: (gate_proj): Linear(0 = 0% Params, 41.18 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 80.82 TFLOPS, in_features=3584, out_features=2560, bias=False)
  413. gpu009: (up_proj): Linear(0 = 0% Params, 41.18 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 81.6 TFLOPS, in_features=3584, out_features=2560, bias=False)
  414. gpu009: (down_proj): Linear(0 = 0% Params, 41.18 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 80.48 TFLOPS, in_features=2560, out_features=3584, bias=False)
  415. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 870.7 us = 0% latency, 6.6 GFLOPS)
  416. gpu009: )
  417. gpu009: (47): Qwen2MoeMLP(
  418. gpu009: 0 = 0% Params, 105.37 GMACs = 0.02% MACs, 9.05 ms = 0.04% latency, 23.28 TFLOPS
  419. gpu009: (gate_proj): Linear(0 = 0% Params, 35.12 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 69.24 TFLOPS, in_features=3584, out_features=2560, bias=False)
  420. gpu009: (up_proj): Linear(0 = 0% Params, 35.12 GMACs = 0.01% MACs, 998.26 us = 0% latency, 70.37 TFLOPS, in_features=3584, out_features=2560, bias=False)
  421. gpu009: (down_proj): Linear(0 = 0% Params, 35.12 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 69.02 TFLOPS, in_features=2560, out_features=3584, bias=False)
  422. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 870.23 us = 0% latency, 5.63 GFLOPS)
  423. gpu009: )
  424. gpu009: (48): Qwen2MoeMLP(
  425. gpu009: 0 = 0% Params, 118.91 GMACs = 0.03% MACs, 8.6 ms = 0.04% latency, 27.66 TFLOPS
  426. gpu009: (gate_proj): Linear(0 = 0% Params, 39.64 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 77.58 TFLOPS, in_features=3584, out_features=2560, bias=False)
  427. gpu009: (up_proj): Linear(0 = 0% Params, 39.64 GMACs = 0.01% MACs, 997.54 us = 0% latency, 79.47 TFLOPS, in_features=3584, out_features=2560, bias=False)
  428. gpu009: (down_proj): Linear(0 = 0% Params, 39.64 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 78.79 TFLOPS, in_features=2560, out_features=3584, bias=False)
  429. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 866.65 us = 0% latency, 6.38 GFLOPS)
  430. gpu009: )
  431. gpu009: (49): Qwen2MoeMLP(
  432. gpu009: 0 = 0% Params, 110.43 GMACs = 0.02% MACs, 8.55 ms = 0.04% latency, 25.83 TFLOPS
  433. gpu009: (gate_proj): Linear(0 = 0% Params, 36.81 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 71.91 TFLOPS, in_features=3584, out_features=2560, bias=False)
  434. gpu009: (up_proj): Linear(0 = 0% Params, 36.81 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 72.79 TFLOPS, in_features=3584, out_features=2560, bias=False)
  435. gpu009: (down_proj): Linear(0 = 0% Params, 36.81 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 72.52 TFLOPS, in_features=2560, out_features=3584, bias=False)
  436. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 894.55 us = 0% latency, 5.74 GFLOPS)
  437. gpu009: )
  438. gpu009: (50): Qwen2MoeMLP(
  439. gpu009: 0 = 0% Params, 111.75 GMACs = 0.02% MACs, 8.58 ms = 0.04% latency, 26.06 TFLOPS
  440. gpu009: (gate_proj): Linear(0 = 0% Params, 37.25 GMACs = 0.01% MACs, 1.03 ms = 0% latency, 72.42 TFLOPS, in_features=3584, out_features=2560, bias=False)
  441. gpu009: (up_proj): Linear(0 = 0% Params, 37.25 GMACs = 0.01% MACs, 987.53 us = 0% latency, 75.44 TFLOPS, in_features=3584, out_features=2560, bias=False)
  442. gpu009: (down_proj): Linear(0 = 0% Params, 37.25 GMACs = 0.01% MACs, 993.97 us = 0% latency, 74.95 TFLOPS, in_features=2560, out_features=3584, bias=False)
  443. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 868.8 us = 0% latency, 5.98 GFLOPS)
  444. gpu009: )
  445. gpu009: (51): Qwen2MoeMLP(
  446. gpu009: 0 = 0% Params, 113.46 GMACs = 0.03% MACs, 8.57 ms = 0.04% latency, 26.49 TFLOPS
  447. gpu009: (gate_proj): Linear(0 = 0% Params, 37.82 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 74.79 TFLOPS, in_features=3584, out_features=2560, bias=False)
  448. gpu009: (up_proj): Linear(0 = 0% Params, 37.82 GMACs = 0.01% MACs, 988.48 us = 0% latency, 76.52 TFLOPS, in_features=3584, out_features=2560, bias=False)
  449. gpu009: (down_proj): Linear(0 = 0% Params, 37.82 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 74.4 TFLOPS, in_features=2560, out_features=3584, bias=False)
  450. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 869.51 us = 0% latency, 6.07 GFLOPS)
  451. gpu009: )
  452. gpu009: (52): Qwen2MoeMLP(
  453. gpu009: 0 = 0% Params, 134.82 GMACs = 0.03% MACs, 8.65 ms = 0.04% latency, 31.17 TFLOPS
  454. gpu009: (gate_proj): Linear(0 = 0% Params, 44.94 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 88.02 TFLOPS, in_features=3584, out_features=2560, bias=False)
  455. gpu009: (up_proj): Linear(0 = 0% Params, 44.94 GMACs = 0.01% MACs, 989.44 us = 0% latency, 90.84 TFLOPS, in_features=3584, out_features=2560, bias=False)
  456. gpu009: (down_proj): Linear(0 = 0% Params, 44.94 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 88.97 TFLOPS, in_features=2560, out_features=3584, bias=False)
  457. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 875 us = 0% latency, 7.17 GFLOPS)
  458. gpu009: )
  459. gpu009: (53): Qwen2MoeMLP(
  460. gpu009: 0 = 0% Params, 103 GMACs = 0.02% MACs, 8.52 ms = 0.04% latency, 24.18 TFLOPS
  461. gpu009: (gate_proj): Linear(0 = 0% Params, 34.33 GMACs = 0.01% MACs, 1.03 ms = 0% latency, 66.47 TFLOPS, in_features=3584, out_features=2560, bias=False)
  462. gpu009: (up_proj): Linear(0 = 0% Params, 34.33 GMACs = 0.01% MACs, 991.11 us = 0% latency, 69.28 TFLOPS, in_features=3584, out_features=2560, bias=False)
  463. gpu009: (down_proj): Linear(0 = 0% Params, 34.33 GMACs = 0.01% MACs, 1 ms = 0% latency, 68.56 TFLOPS, in_features=2560, out_features=3584, bias=False)
  464. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 860.21 us = 0% latency, 5.57 GFLOPS)
  465. gpu009: )
  466. gpu009: (54): Qwen2MoeMLP(
  467. gpu009: 0 = 0% Params, 109.33 GMACs = 0.02% MACs, 12.27 ms = 0.05% latency, 17.82 TFLOPS
  468. gpu009: (gate_proj): Linear(0 = 0% Params, 36.44 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 72.12 TFLOPS, in_features=3584, out_features=2560, bias=False)
  469. gpu009: (up_proj): Linear(0 = 0% Params, 36.44 GMACs = 0.01% MACs, 990.87 us = 0% latency, 73.56 TFLOPS, in_features=3584, out_features=2560, bias=False)
  470. gpu009: (down_proj): Linear(0 = 0% Params, 36.44 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 71.8 TFLOPS, in_features=2560, out_features=3584, bias=False)
  471. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 865.22 us = 0% latency, 5.88 GFLOPS)
  472. gpu009: )
  473. gpu009: (55): Qwen2MoeMLP(
  474. gpu009: 0 = 0% Params, 108.72 GMACs = 0.02% MACs, 8.58 ms = 0.04% latency, 25.34 TFLOPS
  475. gpu009: (gate_proj): Linear(0 = 0% Params, 36.24 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 71.77 TFLOPS, in_features=3584, out_features=2560, bias=False)
  476. gpu009: (up_proj): Linear(0 = 0% Params, 36.24 GMACs = 0.01% MACs, 994.92 us = 0% latency, 72.85 TFLOPS, in_features=3584, out_features=2560, bias=False)
  477. gpu009: (down_proj): Linear(0 = 0% Params, 36.24 GMACs = 0.01% MACs, 1 ms = 0% latency, 72.3 TFLOPS, in_features=2560, out_features=3584, bias=False)
  478. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 872.61 us = 0% latency, 5.79 GFLOPS)
  479. gpu009: )
  480. gpu009: (56): Qwen2MoeMLP(
  481. gpu009: 0 = 0% Params, 104.27 GMACs = 0.02% MACs, 8.45 ms = 0.04% latency, 24.69 TFLOPS
  482. gpu009: (gate_proj): Linear(0 = 0% Params, 34.76 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 68.78 TFLOPS, in_features=3584, out_features=2560, bias=False)
  483. gpu009: (up_proj): Linear(0 = 0% Params, 34.76 GMACs = 0.01% MACs, 989.68 us = 0% latency, 70.24 TFLOPS, in_features=3584, out_features=2560, bias=False)
  484. gpu009: (down_proj): Linear(0 = 0% Params, 34.76 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 68.65 TFLOPS, in_features=2560, out_features=3584, bias=False)
  485. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 871.9 us = 0% latency, 5.56 GFLOPS)
  486. gpu009: )
  487. gpu009: (57): Qwen2MoeMLP(
  488. gpu009: 0 = 0% Params, 104.98 GMACs = 0.02% MACs, 8.56 ms = 0.04% latency, 24.52 TFLOPS
  489. gpu009: (gate_proj): Linear(0 = 0% Params, 34.99 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 69.27 TFLOPS, in_features=3584, out_features=2560, bias=False)
  490. gpu009: (up_proj): Linear(0 = 0% Params, 34.99 GMACs = 0.01% MACs, 983.72 us = 0% latency, 71.15 TFLOPS, in_features=3584, out_features=2560, bias=False)
  491. gpu009: (down_proj): Linear(0 = 0% Params, 34.99 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 68.75 TFLOPS, in_features=2560, out_features=3584, bias=False)
  492. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 859.02 us = 0% latency, 5.68 GFLOPS)
  493. gpu009: )
  494. gpu009: (58): Qwen2MoeMLP(
  495. gpu009: 0 = 0% Params, 118.8 GMACs = 0.03% MACs, 8.67 ms = 0.04% latency, 27.42 TFLOPS
  496. gpu009: (gate_proj): Linear(0 = 0% Params, 39.6 GMACs = 0.01% MACs, 1.03 ms = 0% latency, 77.13 TFLOPS, in_features=3584, out_features=2560, bias=False)
  497. gpu009: (up_proj): Linear(0 = 0% Params, 39.6 GMACs = 0.01% MACs, 997.3 us = 0% latency, 79.41 TFLOPS, in_features=3584, out_features=2560, bias=False)
  498. gpu009: (down_proj): Linear(0 = 0% Params, 39.6 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 78.59 TFLOPS, in_features=2560, out_features=3584, bias=False)
  499. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 866.89 us = 0% latency, 6.37 GFLOPS)
  500. gpu009: )
  501. gpu009: (59): Qwen2MoeMLP(
  502. gpu009: 0 = 0% Params, 114.56 GMACs = 0.03% MACs, 8.8 ms = 0.04% latency, 26.03 TFLOPS
  503. gpu009: (gate_proj): Linear(0 = 0% Params, 38.19 GMACs = 0.01% MACs, 1 ms = 0% latency, 76.18 TFLOPS, in_features=3584, out_features=2560, bias=False)
  504. gpu009: (up_proj): Linear(0 = 0% Params, 38.19 GMACs = 0.01% MACs, 982.28 us = 0% latency, 77.75 TFLOPS, in_features=3584, out_features=2560, bias=False)
  505. gpu009: (down_proj): Linear(0 = 0% Params, 38.19 GMACs = 0.01% MACs, 1 ms = 0% latency, 76.09 TFLOPS, in_features=2560, out_features=3584, bias=False)
  506. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 864.98 us = 0% latency, 6.16 GFLOPS)
  507. gpu009: )
  508. gpu009: (60): Qwen2MoeMLP(
  509. gpu009: 0 = 0% Params, 127.28 GMACs = 0.03% MACs, 8.7 ms = 0.04% latency, 29.26 TFLOPS
  510. gpu009: (gate_proj): Linear(0 = 0% Params, 42.43 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 84.31 TFLOPS, in_features=3584, out_features=2560, bias=False)
  511. gpu009: (up_proj): Linear(0 = 0% Params, 42.43 GMACs = 0.01% MACs, 980.38 us = 0% latency, 86.55 TFLOPS, in_features=3584, out_features=2560, bias=False)
  512. gpu009: (down_proj): Linear(0 = 0% Params, 42.43 GMACs = 0.01% MACs, 1 ms = 0% latency, 84.55 TFLOPS, in_features=2560, out_features=3584, bias=False)
  513. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 865.46 us = 0% latency, 6.84 GFLOPS)
  514. gpu009: )
  515. gpu009: (61): Qwen2MoeMLP(
  516. gpu009: 0 = 0% Params, 106.8 GMACs = 0.02% MACs, 8.57 ms = 0.04% latency, 24.92 TFLOPS
  517. gpu009: (gate_proj): Linear(0 = 0% Params, 35.6 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 70.25 TFLOPS, in_features=3584, out_features=2560, bias=False)
  518. gpu009: (up_proj): Linear(0 = 0% Params, 35.6 GMACs = 0.01% MACs, 1 ms = 0% latency, 70.85 TFLOPS, in_features=3584, out_features=2560, bias=False)
  519. gpu009: (down_proj): Linear(0 = 0% Params, 35.6 GMACs = 0.01% MACs, 996.83 us = 0% latency, 71.42 TFLOPS, in_features=2560, out_features=3584, bias=False)
  520. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 869.04 us = 0% latency, 5.71 GFLOPS)
  521. gpu009: )
  522. gpu009: (62): Qwen2MoeMLP(
  523. gpu009: 0 = 0% Params, 114.12 GMACs = 0.03% MACs, 8.61 ms = 0.04% latency, 26.51 TFLOPS
  524. gpu009: (gate_proj): Linear(0 = 0% Params, 38.04 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 74.89 TFLOPS, in_features=3584, out_features=2560, bias=False)
  525. gpu009: (up_proj): Linear(0 = 0% Params, 38.04 GMACs = 0.01% MACs, 987.77 us = 0% latency, 77.02 TFLOPS, in_features=3584, out_features=2560, bias=False)
  526. gpu009: (down_proj): Linear(0 = 0% Params, 38.04 GMACs = 0.01% MACs, 998.02 us = 0% latency, 76.23 TFLOPS, in_features=2560, out_features=3584, bias=False)
  527. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 857.11 us = 0% latency, 6.19 GFLOPS)
  528. gpu009: )
  529. gpu009: (63): Qwen2MoeMLP(
  530. gpu009: 0 = 0% Params, 109.72 GMACs = 0.02% MACs, 11.99 ms = 0.05% latency, 18.3 TFLOPS
  531. gpu009: (gate_proj): Linear(0 = 0% Params, 36.57 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 72.36 TFLOPS, in_features=3584, out_features=2560, bias=False)
  532. gpu009: (up_proj): Linear(0 = 0% Params, 36.57 GMACs = 0.01% MACs, 989.68 us = 0% latency, 73.91 TFLOPS, in_features=3584, out_features=2560, bias=False)
  533. gpu009: (down_proj): Linear(0 = 0% Params, 36.57 GMACs = 0.01% MACs, 1.03 ms = 0% latency, 71.2 TFLOPS, in_features=2560, out_features=3584, bias=False)
  534. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 866.65 us = 0% latency, 5.89 GFLOPS)
  535. gpu009: )
  536. gpu009: )
  537. gpu009: (shared_expert): Qwen2MoeMLP(
  538. gpu009: 0 = 0% Params, 7.22 TMACs = 1.61% MACs, 41.08 ms = 0.17% latency, 351.26 TFLOPS
  539. gpu009: (gate_proj): Linear(0 = 0% Params, 2.41 TMACs = 0.54% MACs, 8.57 ms = 0.04% latency, 561.51 TFLOPS, in_features=3584, out_features=20480, bias=False)
  540. gpu009: (up_proj): Linear(0 = 0% Params, 2.41 TMACs = 0.54% MACs, 8.57 ms = 0.04% latency, 560.98 TFLOPS, in_features=3584, out_features=20480, bias=False)
  541. gpu009: (down_proj): Linear(0 = 0% Params, 2.41 TMACs = 0.54% MACs, 8.35 ms = 0.04% latency, 575.79 TFLOPS, in_features=20480, out_features=3584, bias=False)
  542. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 897.41 us = 0% latency, 373.9 GFLOPS)
  543. gpu009: )
  544. gpu009: (shared_expert_gate): Linear(3.58 K = 0.83% Params, 117.44 MMACs = 0% MACs, 1.07 ms = 0% latency, 219.17 GFLOPS, in_features=3584, out_features=1, bias=False)
  545. gpu009: )
  546. gpu009: (input_layernorm): Qwen2MoeRMSNorm(3.58 K = 0.83% Params, 0 MACs = 0% MACs, 1.61 ms = 0.01% latency, 0 FLOPS)
  547. gpu009: (post_attention_layernorm): Qwen2MoeRMSNorm(3.58 K = 0.83% Params, 0 MACs = 0% MACs, 1.61 ms = 0.01% latency, 0 FLOPS)
  548. gpu009: )
  549. gpu009: (1): Qwen2MoeDecoderLayer(
  550. gpu009: 15.36 K = 3.54% Params, 15.4 TMACs = 3.43% MACs, 841.01 ms = 3.54% latency, 36.63 TFLOPS
  551. gpu009: (self_attn): Qwen2MoeFlashAttention2(
  552. gpu009: 4.61 K = 1.06% Params, 962.07 GMACs = 0.21% MACs, 28.23 ms = 0.12% latency, 68.17 TFLOPS
  553. gpu009: (q_proj): Linear(3.58 K = 0.83% Params, 420.91 GMACs = 0.09% MACs, 1.88 ms = 0.01% latency, 446.88 TFLOPS, in_features=3584, out_features=3584, bias=True)
  554. gpu009: (k_proj): Linear(512 = 0.12% Params, 60.13 GMACs = 0.01% MACs, 1.17 ms = 0% latency, 102.54 TFLOPS, in_features=3584, out_features=512, bias=True)
  555. gpu009: (v_proj): Linear(512 = 0.12% Params, 60.13 GMACs = 0.01% MACs, 1.17 ms = 0% latency, 102.35 TFLOPS, in_features=3584, out_features=512, bias=True)
  556. gpu009: (o_proj): Linear(0 = 0% Params, 420.91 GMACs = 0.09% MACs, 1.61 ms = 0.01% latency, 521.46 TFLOPS, in_features=3584, out_features=3584, bias=False)
  557. gpu009: (rotary_emb): Qwen2MoeRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 989.91 us = 0% latency, 0 FLOPS)
  558. gpu009: )
  559. gpu009: (mlp): Qwen2MoeSparseMoeBlock(
  560. gpu009: 3.58 K = 0.83% Params, 14.44 TMACs = 3.22% MACs, 795.47 ms = 3.35% latency, 36.3 TFLOPS
  561. gpu009: (gate): Linear(0 = 0% Params, 7.52 GMACs = 0% MACs, 1.1 ms = 0% latency, 13.63 TFLOPS, in_features=3584, out_features=64, bias=False)
  562. gpu009: (experts): ModuleList(
  563. gpu009: (0): Qwen2MoeMLP(
  564. gpu009: 0 = 0% Params, 61.38 GMACs = 0.01% MACs, 9.05 ms = 0.04% latency, 13.56 TFLOPS
  565. gpu009: (gate_proj): Linear(0 = 0% Params, 20.46 GMACs = 0% MACs, 1.12 ms = 0% latency, 36.4 TFLOPS, in_features=3584, out_features=2560, bias=False)
  566. gpu009: (up_proj): Linear(0 = 0% Params, 20.46 GMACs = 0% MACs, 1.09 ms = 0% latency, 37.6 TFLOPS, in_features=3584, out_features=2560, bias=False)
  567. gpu009: (down_proj): Linear(0 = 0% Params, 20.46 GMACs = 0% MACs, 1.14 ms = 0% latency, 36 TFLOPS, in_features=2560, out_features=3584, bias=False)
  568. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 969.17 us = 0% latency, 2.95 GFLOPS)
  569. gpu009: )
  570. gpu009: (1): Qwen2MoeMLP(
  571. gpu009: 0 = 0% Params, 128.98 GMACs = 0.03% MACs, 9.27 ms = 0.04% latency, 27.83 TFLOPS
  572. gpu009: (gate_proj): Linear(0 = 0% Params, 42.99 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 76.17 TFLOPS, in_features=3584, out_features=2560, bias=False)
  573. gpu009: (up_proj): Linear(0 = 0% Params, 42.99 GMACs = 0.01% MACs, 1.09 ms = 0% latency, 78.8 TFLOPS, in_features=3584, out_features=2560, bias=False)
  574. gpu009: (down_proj): Linear(0 = 0% Params, 42.99 GMACs = 0.01% MACs, 1.27 ms = 0.01% latency, 67.45 TFLOPS, in_features=2560, out_features=3584, bias=False)
  575. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 971.08 us = 0% latency, 6.18 GFLOPS)
  576. gpu009: )
  577. gpu009: (2): Qwen2MoeMLP(
  578. gpu009: 0 = 0% Params, 21.47 GMACs = 0% MACs, 12.64 ms = 0.05% latency, 3.4 TFLOPS
  579. gpu009: (gate_proj): Linear(0 = 0% Params, 7.16 GMACs = 0% MACs, 1.12 ms = 0% latency, 12.79 TFLOPS, in_features=3584, out_features=2560, bias=False)
  580. gpu009: (up_proj): Linear(0 = 0% Params, 7.16 GMACs = 0% MACs, 1.12 ms = 0% latency, 12.73 TFLOPS, in_features=3584, out_features=2560, bias=False)
  581. gpu009: (down_proj): Linear(0 = 0% Params, 7.16 GMACs = 0% MACs, 1.17 ms = 0% latency, 12.22 TFLOPS, in_features=2560, out_features=3584, bias=False)
  582. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 991.82 us = 0% latency, 1.01 GFLOPS)
  583. gpu009: )
  584. gpu009: (3): Qwen2MoeMLP(
  585. gpu009: 0 = 0% Params, 58.9 GMACs = 0.01% MACs, 9.17 ms = 0.04% latency, 12.85 TFLOPS
  586. gpu009: (gate_proj): Linear(0 = 0% Params, 19.63 GMACs = 0% MACs, 1.16 ms = 0% latency, 33.93 TFLOPS, in_features=3584, out_features=2560, bias=False)
  587. gpu009: (up_proj): Linear(0 = 0% Params, 19.63 GMACs = 0% MACs, 1.13 ms = 0% latency, 34.88 TFLOPS, in_features=3584, out_features=2560, bias=False)
  588. gpu009: (down_proj): Linear(0 = 0% Params, 19.63 GMACs = 0% MACs, 1.16 ms = 0% latency, 33.84 TFLOPS, in_features=2560, out_features=3584, bias=False)
  589. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 981.09 us = 0% latency, 2.79 GFLOPS)
  590. gpu009: )
  591. gpu009: (4): Qwen2MoeMLP(
  592. gpu009: 0 = 0% Params, 86.92 GMACs = 0.02% MACs, 9.2 ms = 0.04% latency, 18.91 TFLOPS
  593. gpu009: (gate_proj): Linear(0 = 0% Params, 28.97 GMACs = 0.01% MACs, 1.15 ms = 0% latency, 50.31 TFLOPS, in_features=3584, out_features=2560, bias=False)
  594. gpu009: (up_proj): Linear(0 = 0% Params, 28.97 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 51.57 TFLOPS, in_features=3584, out_features=2560, bias=False)
  595. gpu009: (down_proj): Linear(0 = 0% Params, 28.97 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 51.84 TFLOPS, in_features=2560, out_features=3584, bias=False)
  596. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 971.56 us = 0% latency, 4.16 GFLOPS)
  597. gpu009: )
  598. gpu009: (5): Qwen2MoeMLP(
  599. gpu009: 0 = 0% Params, 128.32 GMACs = 0.03% MACs, 9.25 ms = 0.04% latency, 27.76 TFLOPS
  600. gpu009: (gate_proj): Linear(0 = 0% Params, 42.77 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 75.67 TFLOPS, in_features=3584, out_features=2560, bias=False)
  601. gpu009: (up_proj): Linear(0 = 0% Params, 42.77 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 77.58 TFLOPS, in_features=3584, out_features=2560, bias=False)
  602. gpu009: (down_proj): Linear(0 = 0% Params, 42.77 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 76.69 TFLOPS, in_features=2560, out_features=3584, bias=False)
  603. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 987.29 us = 0% latency, 6.04 GFLOPS)
  604. gpu009: )
  605. gpu009: (6): Qwen2MoeMLP(
  606. gpu009: 0 = 0% Params, 168.62 GMACs = 0.04% MACs, 9.07 ms = 0.04% latency, 37.18 TFLOPS
  607. gpu009: (gate_proj): Linear(0 = 0% Params, 56.21 GMACs = 0.01% MACs, 1.14 ms = 0% latency, 98.7 TFLOPS, in_features=3584, out_features=2560, bias=False)
  608. gpu009: (up_proj): Linear(0 = 0% Params, 56.21 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 100.27 TFLOPS, in_features=3584, out_features=2560, bias=False)
  609. gpu009: (down_proj): Linear(0 = 0% Params, 56.21 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 99.91 TFLOPS, in_features=2560, out_features=3584, bias=False)
  610. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 975.13 us = 0% latency, 8.04 GFLOPS)
  611. gpu009: )
  612. gpu009: (7): Qwen2MoeMLP(
  613. gpu009: 0 = 0% Params, 284.88 GMACs = 0.06% MACs, 9.01 ms = 0.04% latency, 63.22 TFLOPS
  614. gpu009: (gate_proj): Linear(0 = 0% Params, 94.96 GMACs = 0.02% MACs, 1.12 ms = 0% latency, 170.07 TFLOPS, in_features=3584, out_features=2560, bias=False)
  615. gpu009: (up_proj): Linear(0 = 0% Params, 94.96 GMACs = 0.02% MACs, 1.1 ms = 0% latency, 173.14 TFLOPS, in_features=3584, out_features=2560, bias=False)
  616. gpu009: (down_proj): Linear(0 = 0% Params, 94.96 GMACs = 0.02% MACs, 1.12 ms = 0% latency, 169.52 TFLOPS, in_features=2560, out_features=3584, bias=False)
  617. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 963.45 us = 0% latency, 13.75 GFLOPS)
  618. gpu009: )
  619. gpu009: (8): Qwen2MoeMLP(
  620. gpu009: 0 = 0% Params, 48.83 GMACs = 0.01% MACs, 9.15 ms = 0.04% latency, 10.67 TFLOPS
  621. gpu009: (gate_proj): Linear(0 = 0% Params, 16.28 GMACs = 0% MACs, 1.13 ms = 0% latency, 28.88 TFLOPS, in_features=3584, out_features=2560, bias=False)
  622. gpu009: (up_proj): Linear(0 = 0% Params, 16.28 GMACs = 0% MACs, 1.09 ms = 0% latency, 29.73 TFLOPS, in_features=3584, out_features=2560, bias=False)
  623. gpu009: (down_proj): Linear(0 = 0% Params, 16.28 GMACs = 0% MACs, 1.12 ms = 0% latency, 29.09 TFLOPS, in_features=2560, out_features=3584, bias=False)
  624. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 986.1 us = 0% latency, 2.3 GFLOPS)
  625. gpu009: )
  626. gpu009: (9): Qwen2MoeMLP(
  627. gpu009: 0 = 0% Params, 168.29 GMACs = 0.04% MACs, 9.49 ms = 0.04% latency, 35.48 TFLOPS
  628. gpu009: (gate_proj): Linear(0 = 0% Params, 56.1 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 99.76 TFLOPS, in_features=3584, out_features=2560, bias=False)
  629. gpu009: (up_proj): Linear(0 = 0% Params, 56.1 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 101.88 TFLOPS, in_features=3584, out_features=2560, bias=False)
  630. gpu009: (down_proj): Linear(0 = 0% Params, 56.1 GMACs = 0.01% MACs, 1.18 ms = 0% latency, 95.39 TFLOPS, in_features=2560, out_features=3584, bias=False)
  631. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 995.4 us = 0% latency, 7.86 GFLOPS)
  632. gpu009: )
  633. gpu009: (10): Qwen2MoeMLP(
  634. gpu009: 0 = 0% Params, 37.49 GMACs = 0.01% MACs, 9.15 ms = 0.04% latency, 8.2 TFLOPS
  635. gpu009: (gate_proj): Linear(0 = 0% Params, 12.5 GMACs = 0% MACs, 1.15 ms = 0% latency, 21.75 TFLOPS, in_features=3584, out_features=2560, bias=False)
  636. gpu009: (up_proj): Linear(0 = 0% Params, 12.5 GMACs = 0% MACs, 1.1 ms = 0% latency, 22.79 TFLOPS, in_features=3584, out_features=2560, bias=False)
  637. gpu009: (down_proj): Linear(0 = 0% Params, 12.5 GMACs = 0% MACs, 1.14 ms = 0% latency, 21.88 TFLOPS, in_features=2560, out_features=3584, bias=False)
  638. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 982.52 us = 0% latency, 1.77 GFLOPS)
  639. gpu009: )
  640. gpu009: (11): Qwen2MoeMLP(
  641. gpu009: 0 = 0% Params, 126.67 GMACs = 0.03% MACs, 14.19 ms = 0.06% latency, 17.86 TFLOPS
  642. gpu009: (gate_proj): Linear(0 = 0% Params, 42.22 GMACs = 0.01% MACs, 1.71 ms = 0.01% latency, 49.5 TFLOPS, in_features=3584, out_features=2560, bias=False)
  643. gpu009: (up_proj): Linear(0 = 0% Params, 42.22 GMACs = 0.01% MACs, 1.62 ms = 0.01% latency, 52.22 TFLOPS, in_features=3584, out_features=2560, bias=False)
  644. gpu009: (down_proj): Linear(0 = 0% Params, 42.22 GMACs = 0.01% MACs, 1.69 ms = 0.01% latency, 49.96 TFLOPS, in_features=2560, out_features=3584, bias=False)
  645. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.59 ms = 0.01% latency, 3.69 GFLOPS)
  646. gpu009: )
  647. gpu009: (12): Qwen2MoeMLP(
  648. gpu009: 0 = 0% Params, 34.19 GMACs = 0.01% MACs, 11.55 ms = 0.05% latency, 5.92 TFLOPS
  649. gpu009: (gate_proj): Linear(0 = 0% Params, 11.4 GMACs = 0% MACs, 1.57 ms = 0.01% latency, 14.5 TFLOPS, in_features=3584, out_features=2560, bias=False)
  650. gpu009: (up_proj): Linear(0 = 0% Params, 11.4 GMACs = 0% MACs, 1.3 ms = 0.01% latency, 17.48 TFLOPS, in_features=3584, out_features=2560, bias=False)
  651. gpu009: (down_proj): Linear(0 = 0% Params, 11.4 GMACs = 0% MACs, 1.7 ms = 0.01% latency, 13.41 TFLOPS, in_features=2560, out_features=3584, bias=False)
  652. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.46 ms = 0.01% latency, 1.09 GFLOPS)
  653. gpu009: )
  654. gpu009: (13): Qwen2MoeMLP(
  655. gpu009: 0 = 0% Params, 34.57 GMACs = 0.01% MACs, 11.93 ms = 0.05% latency, 5.8 TFLOPS
  656. gpu009: (gate_proj): Linear(0 = 0% Params, 11.52 GMACs = 0% MACs, 1.7 ms = 0.01% latency, 13.59 TFLOPS, in_features=3584, out_features=2560, bias=False)
  657. gpu009: (up_proj): Linear(0 = 0% Params, 11.52 GMACs = 0% MACs, 1.66 ms = 0.01% latency, 13.9 TFLOPS, in_features=3584, out_features=2560, bias=False)
  658. gpu009: (down_proj): Linear(0 = 0% Params, 11.52 GMACs = 0% MACs, 1.69 ms = 0.01% latency, 13.6 TFLOPS, in_features=2560, out_features=3584, bias=False)
  659. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.51 ms = 0.01% latency, 1.06 GFLOPS)
  660. gpu009: )
  661. gpu009: (14): Qwen2MoeMLP(
  662. gpu009: 0 = 0% Params, 69.25 GMACs = 0.02% MACs, 9.85 ms = 0.04% latency, 14.06 TFLOPS
  663. gpu009: (gate_proj): Linear(0 = 0% Params, 23.08 GMACs = 0.01% MACs, 1.63 ms = 0.01% latency, 28.28 TFLOPS, in_features=3584, out_features=2560, bias=False)
  664. gpu009: (up_proj): Linear(0 = 0% Params, 23.08 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 41.47 TFLOPS, in_features=3584, out_features=2560, bias=False)
  665. gpu009: (down_proj): Linear(0 = 0% Params, 23.08 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 41.06 TFLOPS, in_features=2560, out_features=3584, bias=False)
  666. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.17 ms = 0% latency, 2.74 GFLOPS)
  667. gpu009: )
  668. gpu009: (15): Qwen2MoeMLP(
  669. gpu009: 0 = 0% Params, 36.66 GMACs = 0.01% MACs, 8.89 ms = 0.04% latency, 8.25 TFLOPS
  670. gpu009: (gate_proj): Linear(0 = 0% Params, 12.22 GMACs = 0% MACs, 1.12 ms = 0% latency, 21.92 TFLOPS, in_features=3584, out_features=2560, bias=False)
  671. gpu009: (up_proj): Linear(0 = 0% Params, 12.22 GMACs = 0% MACs, 1.1 ms = 0% latency, 22.31 TFLOPS, in_features=3584, out_features=2560, bias=False)
  672. gpu009: (down_proj): Linear(0 = 0% Params, 12.22 GMACs = 0% MACs, 1.12 ms = 0% latency, 21.91 TFLOPS, in_features=2560, out_features=3584, bias=False)
  673. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 970.13 us = 0% latency, 1.76 GFLOPS)
  674. gpu009: )
  675. gpu009: (16): Qwen2MoeMLP(
  676. gpu009: 0 = 0% Params, 102.67 GMACs = 0.02% MACs, 8.98 ms = 0.04% latency, 22.86 TFLOPS
  677. gpu009: (gate_proj): Linear(0 = 0% Params, 34.22 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 61.33 TFLOPS, in_features=3584, out_features=2560, bias=False)
  678. gpu009: (up_proj): Linear(0 = 0% Params, 34.22 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 61.98 TFLOPS, in_features=3584, out_features=2560, bias=False)
  679. gpu009: (down_proj): Linear(0 = 0% Params, 34.22 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 62.17 TFLOPS, in_features=2560, out_features=3584, bias=False)
  680. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 979.19 us = 0% latency, 4.88 GFLOPS)
  681. gpu009: )
  682. gpu009: (17): Qwen2MoeMLP(
  683. gpu009: 0 = 0% Params, 98.32 GMACs = 0.02% MACs, 9.09 ms = 0.04% latency, 21.63 TFLOPS
  684. gpu009: (gate_proj): Linear(0 = 0% Params, 32.77 GMACs = 0.01% MACs, 1.14 ms = 0% latency, 57.59 TFLOPS, in_features=3584, out_features=2560, bias=False)
  685. gpu009: (up_proj): Linear(0 = 0% Params, 32.77 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 59.49 TFLOPS, in_features=3584, out_features=2560, bias=False)
  686. gpu009: (down_proj): Linear(0 = 0% Params, 32.77 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 59.08 TFLOPS, in_features=2560, out_features=3584, bias=False)
  687. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 977.28 us = 0% latency, 4.68 GFLOPS)
  688. gpu009: )
  689. gpu009: (18): Qwen2MoeMLP(
  690. gpu009: 0 = 0% Params, 31.32 GMACs = 0.01% MACs, 9.09 ms = 0.04% latency, 6.89 TFLOPS
  691. gpu009: (gate_proj): Linear(0 = 0% Params, 10.44 GMACs = 0% MACs, 1.12 ms = 0% latency, 18.7 TFLOPS, in_features=3584, out_features=2560, bias=False)
  692. gpu009: (up_proj): Linear(0 = 0% Params, 10.44 GMACs = 0% MACs, 1.09 ms = 0% latency, 19.08 TFLOPS, in_features=3584, out_features=2560, bias=False)
  693. gpu009: (down_proj): Linear(0 = 0% Params, 10.44 GMACs = 0% MACs, 1.13 ms = 0% latency, 18.44 TFLOPS, in_features=2560, out_features=3584, bias=False)
  694. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 979.66 us = 0% latency, 1.49 GFLOPS)
  695. gpu009: )
  696. gpu009: (19): Qwen2MoeMLP(
  697. gpu009: 0 = 0% Params, 245.47 GMACs = 0.05% MACs, 9.16 ms = 0.04% latency, 53.6 TFLOPS
  698. gpu009: (gate_proj): Linear(0 = 0% Params, 81.82 GMACs = 0.02% MACs, 1.13 ms = 0% latency, 144.35 TFLOPS, in_features=3584, out_features=2560, bias=False)
  699. gpu009: (up_proj): Linear(0 = 0% Params, 81.82 GMACs = 0.02% MACs, 1.1 ms = 0% latency, 149.02 TFLOPS, in_features=3584, out_features=2560, bias=False)
  700. gpu009: (down_proj): Linear(0 = 0% Params, 81.82 GMACs = 0.02% MACs, 1.12 ms = 0% latency, 145.57 TFLOPS, in_features=2560, out_features=3584, bias=False)
  701. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 972.99 us = 0% latency, 11.73 GFLOPS)
  702. gpu009: )
  703. gpu009: (20): Qwen2MoeMLP(
  704. gpu009: 0 = 0% Params, 230.11 GMACs = 0.05% MACs, 8.97 ms = 0.04% latency, 51.33 TFLOPS
  705. gpu009: (gate_proj): Linear(0 = 0% Params, 76.7 GMACs = 0.02% MACs, 1.13 ms = 0% latency, 135.37 TFLOPS, in_features=3584, out_features=2560, bias=False)
  706. gpu009: (up_proj): Linear(0 = 0% Params, 76.7 GMACs = 0.02% MACs, 1.09 ms = 0% latency, 140.61 TFLOPS, in_features=3584, out_features=2560, bias=False)
  707. gpu009: (down_proj): Linear(0 = 0% Params, 76.7 GMACs = 0.02% MACs, 1.13 ms = 0% latency, 135.92 TFLOPS, in_features=2560, out_features=3584, bias=False)
  708. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 966.55 us = 0% latency, 11.07 GFLOPS)
  709. gpu009: )
  710. gpu009: (21): Qwen2MoeMLP(
  711. gpu009: 0 = 0% Params, 169.72 GMACs = 0.04% MACs, 9.08 ms = 0.04% latency, 37.36 TFLOPS
  712. gpu009: (gate_proj): Linear(0 = 0% Params, 56.57 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 100.27 TFLOPS, in_features=3584, out_features=2560, bias=False)
  713. gpu009: (up_proj): Linear(0 = 0% Params, 56.57 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 103.3 TFLOPS, in_features=3584, out_features=2560, bias=False)
  714. gpu009: (down_proj): Linear(0 = 0% Params, 56.57 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 100.02 TFLOPS, in_features=2560, out_features=3584, bias=False)
  715. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 967.74 us = 0% latency, 8.16 GFLOPS)
  716. gpu009: )
  717. gpu009: (22): Qwen2MoeMLP(
  718. gpu009: 0 = 0% Params, 89.95 GMACs = 0.02% MACs, 8.94 ms = 0.04% latency, 20.12 TFLOPS
  719. gpu009: (gate_proj): Linear(0 = 0% Params, 29.98 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 53.63 TFLOPS, in_features=3584, out_features=2560, bias=False)
  720. gpu009: (up_proj): Linear(0 = 0% Params, 29.98 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 54.3 TFLOPS, in_features=3584, out_features=2560, bias=False)
  721. gpu009: (down_proj): Linear(0 = 0% Params, 29.98 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 53.81 TFLOPS, in_features=2560, out_features=3584, bias=False)
  722. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 955.58 us = 0% latency, 4.38 GFLOPS)
  723. gpu009: )
  724. gpu009: (23): Qwen2MoeMLP(
  725. gpu009: 0 = 0% Params, 11.18 GMACs = 0% MACs, 9.1 ms = 0.04% latency, 2.46 TFLOPS
  726. gpu009: (gate_proj): Linear(0 = 0% Params, 3.73 GMACs = 0% MACs, 1.13 ms = 0% latency, 6.61 TFLOPS, in_features=3584, out_features=2560, bias=False)
  727. gpu009: (up_proj): Linear(0 = 0% Params, 3.73 GMACs = 0% MACs, 1.1 ms = 0% latency, 6.78 TFLOPS, in_features=3584, out_features=2560, bias=False)
  728. gpu009: (down_proj): Linear(0 = 0% Params, 3.73 GMACs = 0% MACs, 1.12 ms = 0% latency, 6.64 TFLOPS, in_features=2560, out_features=3584, bias=False)
  729. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 972.75 us = 0% latency, 534.24 MFLOPS)
  730. gpu009: )
  731. gpu009: (24): Qwen2MoeMLP(
  732. gpu009: 0 = 0% Params, 124.8 GMACs = 0.03% MACs, 13 ms = 0.05% latency, 19.21 TFLOPS
  733. gpu009: (gate_proj): Linear(0 = 0% Params, 41.6 GMACs = 0.01% MACs, 1.15 ms = 0% latency, 72.23 TFLOPS, in_features=3584, out_features=2560, bias=False)
  734. gpu009: (up_proj): Linear(0 = 0% Params, 41.6 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 75.44 TFLOPS, in_features=3584, out_features=2560, bias=False)
  735. gpu009: (down_proj): Linear(0 = 0% Params, 41.6 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 73.36 TFLOPS, in_features=2560, out_features=3584, bias=False)
  736. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 972.99 us = 0% latency, 5.96 GFLOPS)
  737. gpu009: )
  738. gpu009: (25): Qwen2MoeMLP(
  739. gpu009: 0 = 0% Params, 66.61 GMACs = 0.01% MACs, 9.17 ms = 0.04% latency, 14.53 TFLOPS
  740. gpu009: (gate_proj): Linear(0 = 0% Params, 22.2 GMACs = 0% MACs, 1.13 ms = 0% latency, 39.15 TFLOPS, in_features=3584, out_features=2560, bias=False)
  741. gpu009: (up_proj): Linear(0 = 0% Params, 22.2 GMACs = 0% MACs, 1.11 ms = 0% latency, 39.88 TFLOPS, in_features=3584, out_features=2560, bias=False)
  742. gpu009: (down_proj): Linear(0 = 0% Params, 22.2 GMACs = 0% MACs, 1.13 ms = 0% latency, 39.44 TFLOPS, in_features=2560, out_features=3584, bias=False)
  743. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 974.89 us = 0% latency, 3.18 GFLOPS)
  744. gpu009: )
  745. gpu009: (26): Qwen2MoeMLP(
  746. gpu009: 0 = 0% Params, 71.68 GMACs = 0.02% MACs, 9.62 ms = 0.04% latency, 14.91 TFLOPS
  747. gpu009: (gate_proj): Linear(0 = 0% Params, 23.89 GMACs = 0.01% MACs, 1.14 ms = 0% latency, 42.03 TFLOPS, in_features=3584, out_features=2560, bias=False)
  748. gpu009: (up_proj): Linear(0 = 0% Params, 23.89 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 43.55 TFLOPS, in_features=3584, out_features=2560, bias=False)
  749. gpu009: (down_proj): Linear(0 = 0% Params, 23.89 GMACs = 0.01% MACs, 1.14 ms = 0% latency, 42.07 TFLOPS, in_features=2560, out_features=3584, bias=False)
  750. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 977.04 us = 0% latency, 3.41 GFLOPS)
  751. gpu009: )
  752. gpu009: (27): Qwen2MoeMLP(
  753. gpu009: 0 = 0% Params, 26.81 GMACs = 0.01% MACs, 10.23 ms = 0.04% latency, 5.24 TFLOPS
  754. gpu009: (gate_proj): Linear(0 = 0% Params, 8.94 GMACs = 0% MACs, 1.13 ms = 0% latency, 15.79 TFLOPS, in_features=3584, out_features=2560, bias=False)
  755. gpu009: (up_proj): Linear(0 = 0% Params, 8.94 GMACs = 0% MACs, 1.1 ms = 0% latency, 16.18 TFLOPS, in_features=3584, out_features=2560, bias=False)
  756. gpu009: (down_proj): Linear(0 = 0% Params, 8.94 GMACs = 0% MACs, 1.15 ms = 0% latency, 15.59 TFLOPS, in_features=2560, out_features=3584, bias=False)
  757. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 984.43 us = 0% latency, 1.27 GFLOPS)
  758. gpu009: )
  759. gpu009: (28): Qwen2MoeMLP(
  760. gpu009: 0 = 0% Params, 90.72 GMACs = 0.02% MACs, 6.59 ms = 0.03% latency, 27.52 TFLOPS
  761. gpu009: (gate_proj): Linear(0 = 0% Params, 30.24 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 54.37 TFLOPS, in_features=3584, out_features=2560, bias=False)
  762. gpu009: (up_proj): Linear(0 = 0% Params, 30.24 GMACs = 0.01% MACs, 1.09 ms = 0% latency, 55.57 TFLOPS, in_features=3584, out_features=2560, bias=False)
  763. gpu009: (down_proj): Linear(0 = 0% Params, 30.24 GMACs = 0.01% MACs, 1.09 ms = 0% latency, 55.32 TFLOPS, in_features=2560, out_features=3584, bias=False)
  764. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 971.08 us = 0% latency, 4.34 GFLOPS)
  765. gpu009: )
  766. gpu009: (29): Qwen2MoeMLP(
  767. gpu009: 0 = 0% Params, 215.41 GMACs = 0.05% MACs, 6.13 ms = 0.03% latency, 70.32 TFLOPS
  768. gpu009: (gate_proj): Linear(0 = 0% Params, 71.8 GMACs = 0.02% MACs, 1.11 ms = 0% latency, 128.98 TFLOPS, in_features=3584, out_features=2560, bias=False)
  769. gpu009: (up_proj): Linear(0 = 0% Params, 71.8 GMACs = 0.02% MACs, 1.09 ms = 0% latency, 132.24 TFLOPS, in_features=3584, out_features=2560, bias=False)
  770. gpu009: (down_proj): Linear(0 = 0% Params, 71.8 GMACs = 0.02% MACs, 1.08 ms = 0% latency, 133.35 TFLOPS, in_features=2560, out_features=3584, bias=False)
  771. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 936.75 us = 0% latency, 10.69 GFLOPS)
  772. gpu009: )
  773. gpu009: (30): Qwen2MoeMLP(
  774. gpu009: 0 = 0% Params, 198.51 GMACs = 0.04% MACs, 20.09 ms = 0.08% latency, 19.76 TFLOPS
  775. gpu009: (gate_proj): Linear(0 = 0% Params, 66.17 GMACs = 0.01% MACs, 1.06 ms = 0% latency, 124.65 TFLOPS, in_features=3584, out_features=2560, bias=False)
  776. gpu009: (up_proj): Linear(0 = 0% Params, 66.17 GMACs = 0.01% MACs, 1.06 ms = 0% latency, 124.6 TFLOPS, in_features=3584, out_features=2560, bias=False)
  777. gpu009: (down_proj): Linear(0 = 0% Params, 66.17 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 119.99 TFLOPS, in_features=2560, out_features=3584, bias=False)
  778. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 913.14 us = 0% latency, 10.11 GFLOPS)
  779. gpu009: )
  780. gpu009: (31): Qwen2MoeMLP(
  781. gpu009: 0 = 0% Params, 201.92 GMACs = 0.04% MACs, 5.91 ms = 0.02% latency, 68.28 TFLOPS
  782. gpu009: (gate_proj): Linear(0 = 0% Params, 67.31 GMACs = 0.01% MACs, 1.06 ms = 0% latency, 126.51 TFLOPS, in_features=3584, out_features=2560, bias=False)
  783. gpu009: (up_proj): Linear(0 = 0% Params, 67.31 GMACs = 0.01% MACs, 1.05 ms = 0% latency, 128 TFLOPS, in_features=3584, out_features=2560, bias=False)
  784. gpu009: (down_proj): Linear(0 = 0% Params, 67.31 GMACs = 0.01% MACs, 1.03 ms = 0% latency, 130.79 TFLOPS, in_features=2560, out_features=3584, bias=False)
  785. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 903.37 us = 0% latency, 10.39 GFLOPS)
  786. gpu009: )
  787. gpu009: (32): Qwen2MoeMLP(
  788. gpu009: 0 = 0% Params, 285.66 GMACs = 0.06% MACs, 6.2 ms = 0.03% latency, 92.22 TFLOPS
  789. gpu009: (gate_proj): Linear(0 = 0% Params, 95.22 GMACs = 0.02% MACs, 1.07 ms = 0% latency, 177.38 TFLOPS, in_features=3584, out_features=2560, bias=False)
  790. gpu009: (up_proj): Linear(0 = 0% Params, 95.22 GMACs = 0.02% MACs, 1.01 ms = 0% latency, 188.61 TFLOPS, in_features=3584, out_features=2560, bias=False)
  791. gpu009: (down_proj): Linear(0 = 0% Params, 95.22 GMACs = 0.02% MACs, 1.02 ms = 0% latency, 186.28 TFLOPS, in_features=2560, out_features=3584, bias=False)
  792. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 887.87 us = 0% latency, 14.96 GFLOPS)
  793. gpu009: )
  794. gpu009: (33): Qwen2MoeMLP(
  795. gpu009: 0 = 0% Params, 111.64 GMACs = 0.02% MACs, 18.6 ms = 0.08% latency, 12 TFLOPS
  796. gpu009: (gate_proj): Linear(0 = 0% Params, 37.21 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 73.28 TFLOPS, in_features=3584, out_features=2560, bias=False)
  797. gpu009: (up_proj): Linear(0 = 0% Params, 37.21 GMACs = 0.01% MACs, 1 ms = 0% latency, 74.06 TFLOPS, in_features=3584, out_features=2560, bias=False)
  798. gpu009: (down_proj): Linear(0 = 0% Params, 37.21 GMACs = 0.01% MACs, 996.35 us = 0% latency, 74.7 TFLOPS, in_features=2560, out_features=3584, bias=False)
  799. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 877.38 us = 0% latency, 5.92 GFLOPS)
  800. gpu009: )
  801. gpu009: (34): Qwen2MoeMLP(
  802. gpu009: 0 = 0% Params, 155.24 GMACs = 0.03% MACs, 6.2 ms = 0.03% latency, 50.05 TFLOPS
  803. gpu009: (gate_proj): Linear(0 = 0% Params, 51.75 GMACs = 0.01% MACs, 1 ms = 0% latency, 103.04 TFLOPS, in_features=3584, out_features=2560, bias=False)
  804. gpu009: (up_proj): Linear(0 = 0% Params, 51.75 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 91.93 TFLOPS, in_features=3584, out_features=2560, bias=False)
  805. gpu009: (down_proj): Linear(0 = 0% Params, 51.75 GMACs = 0.01% MACs, 980.38 us = 0% latency, 105.57 TFLOPS, in_features=2560, out_features=3584, bias=False)
  806. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 854.73 us = 0% latency, 8.45 GFLOPS)
  807. gpu009: )
  808. gpu009: (35): Qwen2MoeMLP(
  809. gpu009: 0 = 0% Params, 227.25 GMACs = 0.05% MACs, 5.62 ms = 0.02% latency, 80.81 TFLOPS
  810. gpu009: (gate_proj): Linear(0 = 0% Params, 75.75 GMACs = 0.02% MACs, 992.3 us = 0% latency, 152.67 TFLOPS, in_features=3584, out_features=2560, bias=False)
  811. gpu009: (up_proj): Linear(0 = 0% Params, 75.75 GMACs = 0.02% MACs, 971.79 us = 0% latency, 155.9 TFLOPS, in_features=3584, out_features=2560, bias=False)
  812. gpu009: (down_proj): Linear(0 = 0% Params, 75.75 GMACs = 0.02% MACs, 999.93 us = 0% latency, 151.51 TFLOPS, in_features=2560, out_features=3584, bias=False)
  813. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 819.21 us = 0% latency, 12.9 GFLOPS)
  814. gpu009: )
  815. gpu009: (36): Qwen2MoeMLP(
  816. gpu009: 0 = 0% Params, 114.94 GMACs = 0.03% MACs, 8.39 ms = 0.04% latency, 27.4 TFLOPS
  817. gpu009: (gate_proj): Linear(0 = 0% Params, 38.31 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 75.43 TFLOPS, in_features=3584, out_features=2560, bias=False)
  818. gpu009: (up_proj): Linear(0 = 0% Params, 38.31 GMACs = 0.01% MACs, 1 ms = 0% latency, 76.42 TFLOPS, in_features=3584, out_features=2560, bias=False)
  819. gpu009: (down_proj): Linear(0 = 0% Params, 38.31 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 75.08 TFLOPS, in_features=2560, out_features=3584, bias=False)
  820. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 864.27 us = 0% latency, 6.18 GFLOPS)
  821. gpu009: )
  822. gpu009: (37): Qwen2MoeMLP(
  823. gpu009: 0 = 0% Params, 196.14 GMACs = 0.04% MACs, 9.52 ms = 0.04% latency, 41.22 TFLOPS
  824. gpu009: (gate_proj): Linear(0 = 0% Params, 65.38 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 128.23 TFLOPS, in_features=3584, out_features=2560, bias=False)
  825. gpu009: (up_proj): Linear(0 = 0% Params, 65.38 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 130 TFLOPS, in_features=3584, out_features=2560, bias=False)
  826. gpu009: (down_proj): Linear(0 = 0% Params, 65.38 GMACs = 0.01% MACs, 1.03 ms = 0% latency, 127.02 TFLOPS, in_features=2560, out_features=3584, bias=False)
  827. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 878.1 us = 0% latency, 10.39 GFLOPS)
  828. gpu009: )
  829. gpu009: (38): Qwen2MoeMLP(
  830. gpu009: 0 = 0% Params, 325.02 GMACs = 0.07% MACs, 8.62 ms = 0.04% latency, 75.43 TFLOPS
  831. gpu009: (gate_proj): Linear(0 = 0% Params, 108.34 GMACs = 0.02% MACs, 1.01 ms = 0% latency, 214.04 TFLOPS, in_features=3584, out_features=2560, bias=False)
  832. gpu009: (up_proj): Linear(0 = 0% Params, 108.34 GMACs = 0.02% MACs, 993.25 us = 0% latency, 218.15 TFLOPS, in_features=3584, out_features=2560, bias=False)
  833. gpu009: (down_proj): Linear(0 = 0% Params, 108.34 GMACs = 0.02% MACs, 1.03 ms = 0% latency, 210.91 TFLOPS, in_features=2560, out_features=3584, bias=False)
  834. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 864.03 us = 0% latency, 17.49 GFLOPS)
  835. gpu009: )
  836. gpu009: (39): Qwen2MoeMLP(
  837. gpu009: 0 = 0% Params, 109.61 GMACs = 0.02% MACs, 8.92 ms = 0.04% latency, 24.57 TFLOPS
  838. gpu009: (gate_proj): Linear(0 = 0% Params, 36.54 GMACs = 0.01% MACs, 1.03 ms = 0% latency, 71.21 TFLOPS, in_features=3584, out_features=2560, bias=False)
  839. gpu009: (up_proj): Linear(0 = 0% Params, 36.54 GMACs = 0.01% MACs, 995.64 us = 0% latency, 73.39 TFLOPS, in_features=3584, out_features=2560, bias=False)
  840. gpu009: (down_proj): Linear(0 = 0% Params, 36.54 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 72.49 TFLOPS, in_features=2560, out_features=3584, bias=False)
  841. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 873.09 us = 0% latency, 5.84 GFLOPS)
  842. gpu009: )
  843. gpu009: (40): Qwen2MoeMLP(
  844. gpu009: 0 = 0% Params, 254.94 GMACs = 0.06% MACs, 8.64 ms = 0.04% latency, 59.03 TFLOPS
  845. gpu009: (gate_proj): Linear(0 = 0% Params, 84.98 GMACs = 0.02% MACs, 1.02 ms = 0% latency, 167.3 TFLOPS, in_features=3584, out_features=2560, bias=False)
  846. gpu009: (up_proj): Linear(0 = 0% Params, 84.98 GMACs = 0.02% MACs, 984.67 us = 0% latency, 172.6 TFLOPS, in_features=3584, out_features=2560, bias=False)
  847. gpu009: (down_proj): Linear(0 = 0% Params, 84.98 GMACs = 0.02% MACs, 1.03 ms = 0% latency, 165.55 TFLOPS, in_features=2560, out_features=3584, bias=False)
  848. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 859.5 us = 0% latency, 13.79 GFLOPS)
  849. gpu009: )
  850. gpu009: (41): Qwen2MoeMLP(
  851. gpu009: 0 = 0% Params, 220.97 GMACs = 0.05% MACs, 8.62 ms = 0.04% latency, 51.27 TFLOPS
  852. gpu009: (gate_proj): Linear(0 = 0% Params, 73.66 GMACs = 0.02% MACs, 1.02 ms = 0% latency, 144.26 TFLOPS, in_features=3584, out_features=2560, bias=False)
  853. gpu009: (up_proj): Linear(0 = 0% Params, 73.66 GMACs = 0.02% MACs, 986.58 us = 0% latency, 149.32 TFLOPS, in_features=3584, out_features=2560, bias=False)
  854. gpu009: (down_proj): Linear(0 = 0% Params, 73.66 GMACs = 0.02% MACs, 1.01 ms = 0% latency, 145.42 TFLOPS, in_features=2560, out_features=3584, bias=False)
  855. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 860.93 us = 0% latency, 11.94 GFLOPS)
  856. gpu009: )
  857. gpu009: (42): Qwen2MoeMLP(
  858. gpu009: 0 = 0% Params, 240.35 GMACs = 0.05% MACs, 8.67 ms = 0.04% latency, 55.44 TFLOPS
  859. gpu009: (gate_proj): Linear(0 = 0% Params, 80.12 GMACs = 0.02% MACs, 1.04 ms = 0% latency, 154.04 TFLOPS, in_features=3584, out_features=2560, bias=False)
  860. gpu009: (up_proj): Linear(0 = 0% Params, 80.12 GMACs = 0.02% MACs, 1.05 ms = 0% latency, 153.26 TFLOPS, in_features=3584, out_features=2560, bias=False)
  861. gpu009: (down_proj): Linear(0 = 0% Params, 80.12 GMACs = 0.02% MACs, 1.05 ms = 0% latency, 153.26 TFLOPS, in_features=2560, out_features=3584, bias=False)
  862. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 884.29 us = 0% latency, 12.64 GFLOPS)
  863. gpu009: )
  864. gpu009: (43): Qwen2MoeMLP(
  865. gpu009: 0 = 0% Params, 9.52 GMACs = 0% MACs, 8.92 ms = 0.04% latency, 2.14 TFLOPS
  866. gpu009: (gate_proj): Linear(0 = 0% Params, 3.17 GMACs = 0% MACs, 1.01 ms = 0% latency, 6.31 TFLOPS, in_features=3584, out_features=2560, bias=False)
  867. gpu009: (up_proj): Linear(0 = 0% Params, 3.17 GMACs = 0% MACs, 983 us = 0% latency, 6.46 TFLOPS, in_features=3584, out_features=2560, bias=False)
  868. gpu009: (down_proj): Linear(0 = 0% Params, 3.17 GMACs = 0% MACs, 1.22 ms = 0.01% latency, 5.22 TFLOPS, in_features=2560, out_features=3584, bias=False)
  869. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 862.84 us = 0% latency, 513.28 MFLOPS)
  870. gpu009: )
  871. gpu009: (44): Qwen2MoeMLP(
  872. gpu009: 0 = 0% Params, 118.63 GMACs = 0.03% MACs, 8.63 ms = 0.04% latency, 27.5 TFLOPS
  873. gpu009: (gate_proj): Linear(0 = 0% Params, 39.54 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 77.71 TFLOPS, in_features=3584, out_features=2560, bias=False)
  874. gpu009: (up_proj): Linear(0 = 0% Params, 39.54 GMACs = 0.01% MACs, 999.93 us = 0% latency, 79.09 TFLOPS, in_features=3584, out_features=2560, bias=False)
  875. gpu009: (down_proj): Linear(0 = 0% Params, 39.54 GMACs = 0.01% MACs, 993.73 us = 0% latency, 79.59 TFLOPS, in_features=2560, out_features=3584, bias=False)
  876. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 853.54 us = 0% latency, 6.46 GFLOPS)
  877. gpu009: )
  878. gpu009: (45): Qwen2MoeMLP(
  879. gpu009: 0 = 0% Params, 142.14 GMACs = 0.03% MACs, 8.45 ms = 0.04% latency, 33.63 TFLOPS
  880. gpu009: (gate_proj): Linear(0 = 0% Params, 47.38 GMACs = 0.01% MACs, 1.03 ms = 0% latency, 92.05 TFLOPS, in_features=3584, out_features=2560, bias=False)
  881. gpu009: (up_proj): Linear(0 = 0% Params, 47.38 GMACs = 0.01% MACs, 982.76 us = 0% latency, 96.42 TFLOPS, in_features=3584, out_features=2560, bias=False)
  882. gpu009: (down_proj): Linear(0 = 0% Params, 47.38 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 93.87 TFLOPS, in_features=2560, out_features=3584, bias=False)
  883. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 859.02 us = 0% latency, 7.69 GFLOPS)
  884. gpu009: )
  885. gpu009: (46): Qwen2MoeMLP(
  886. gpu009: 0 = 0% Params, 105.81 GMACs = 0.02% MACs, 8.58 ms = 0.04% latency, 24.65 TFLOPS
  887. gpu009: (gate_proj): Linear(0 = 0% Params, 35.27 GMACs = 0.01% MACs, 1.04 ms = 0% latency, 67.69 TFLOPS, in_features=3584, out_features=2560, bias=False)
  888. gpu009: (up_proj): Linear(0 = 0% Params, 35.27 GMACs = 0.01% MACs, 992.77 us = 0% latency, 71.05 TFLOPS, in_features=3584, out_features=2560, bias=False)
  889. gpu009: (down_proj): Linear(0 = 0% Params, 35.27 GMACs = 0.01% MACs, 1.03 ms = 0% latency, 68.77 TFLOPS, in_features=2560, out_features=3584, bias=False)
  890. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 869.75 us = 0% latency, 5.66 GFLOPS)
  891. gpu009: )
  892. gpu009: (47): Qwen2MoeMLP(
  893. gpu009: 0 = 0% Params, 142.03 GMACs = 0.03% MACs, 8.55 ms = 0.04% latency, 33.21 TFLOPS
  894. gpu009: (gate_proj): Linear(0 = 0% Params, 47.34 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 92.86 TFLOPS, in_features=3584, out_features=2560, bias=False)
  895. gpu009: (up_proj): Linear(0 = 0% Params, 47.34 GMACs = 0.01% MACs, 991.11 us = 0% latency, 95.54 TFLOPS, in_features=3584, out_features=2560, bias=False)
  896. gpu009: (down_proj): Linear(0 = 0% Params, 47.34 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 92.94 TFLOPS, in_features=2560, out_features=3584, bias=False)
  897. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 855.45 us = 0% latency, 7.72 GFLOPS)
  898. gpu009: )
  899. gpu009: (48): Qwen2MoeMLP(
  900. gpu009: 0 = 0% Params, 126.67 GMACs = 0.03% MACs, 8.68 ms = 0.04% latency, 29.18 TFLOPS
  901. gpu009: (gate_proj): Linear(0 = 0% Params, 42.22 GMACs = 0.01% MACs, 991.58 us = 0% latency, 85.16 TFLOPS, in_features=3584, out_features=2560, bias=False)
  902. gpu009: (up_proj): Linear(0 = 0% Params, 42.22 GMACs = 0.01% MACs, 980.14 us = 0% latency, 86.16 TFLOPS, in_features=3584, out_features=2560, bias=False)
  903. gpu009: (down_proj): Linear(0 = 0% Params, 42.22 GMACs = 0.01% MACs, 1 ms = 0% latency, 84.05 TFLOPS, in_features=2560, out_features=3584, bias=False)
  904. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 859.5 us = 0% latency, 6.85 GFLOPS)
  905. gpu009: )
  906. gpu009: (49): Qwen2MoeMLP(
  907. gpu009: 0 = 0% Params, 4.9 GMACs = 0% MACs, 8.53 ms = 0.04% latency, 1.15 TFLOPS
  908. gpu009: (gate_proj): Linear(0 = 0% Params, 1.63 GMACs = 0% MACs, 1.02 ms = 0% latency, 3.2 TFLOPS, in_features=3584, out_features=2560, bias=False)
  909. gpu009: (up_proj): Linear(0 = 0% Params, 1.63 GMACs = 0% MACs, 1.01 ms = 0% latency, 3.24 TFLOPS, in_features=3584, out_features=2560, bias=False)
  910. gpu009: (down_proj): Linear(0 = 0% Params, 1.63 GMACs = 0% MACs, 1.01 ms = 0% latency, 3.22 TFLOPS, in_features=2560, out_features=3584, bias=False)
  911. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 864.98 us = 0% latency, 263.4 MFLOPS)
  912. gpu009: )
  913. gpu009: (50): Qwen2MoeMLP(
  914. gpu009: 0 = 0% Params, 1.27 GMACs = 0% MACs, 8.56 ms = 0.04% latency, 295.86 GFLOPS
  915. gpu009: (gate_proj): Linear(0 = 0% Params, 422.05 MMACs = 0% MACs, 1.1 ms = 0% latency, 769.16 GFLOPS, in_features=3584, out_features=2560, bias=False)
  916. gpu009: (up_proj): Linear(0 = 0% Params, 422.05 MMACs = 0% MACs, 984.43 us = 0% latency, 857.45 GFLOPS, in_features=3584, out_features=2560, bias=False)
  917. gpu009: (down_proj): Linear(0 = 0% Params, 422.05 MMACs = 0% MACs, 1.02 ms = 0% latency, 829.92 GFLOPS, in_features=2560, out_features=3584, bias=False)
  918. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 883.1 us = 0% latency, 66.67 MFLOPS)
  919. gpu009: )
  920. gpu009: (51): Qwen2MoeMLP(
  921. gpu009: 0 = 0% Params, 6.44 GMACs = 0% MACs, 8.52 ms = 0.04% latency, 1.51 TFLOPS
  922. gpu009: (gate_proj): Linear(0 = 0% Params, 2.15 GMACs = 0% MACs, 1.03 ms = 0% latency, 4.19 TFLOPS, in_features=3584, out_features=2560, bias=False)
  923. gpu009: (up_proj): Linear(0 = 0% Params, 2.15 GMACs = 0% MACs, 993.25 us = 0% latency, 4.32 TFLOPS, in_features=3584, out_features=2560, bias=False)
  924. gpu009: (down_proj): Linear(0 = 0% Params, 2.15 GMACs = 0% MACs, 1.02 ms = 0% latency, 4.23 TFLOPS, in_features=2560, out_features=3584, bias=False)
  925. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 867.13 us = 0% latency, 345.42 MFLOPS)
  926. gpu009: )
  927. gpu009: (52): Qwen2MoeMLP(
  928. gpu009: 0 = 0% Params, 61.38 GMACs = 0.01% MACs, 8.58 ms = 0.04% latency, 14.32 TFLOPS
  929. gpu009: (gate_proj): Linear(0 = 0% Params, 20.46 GMACs = 0% MACs, 995.4 us = 0% latency, 41.11 TFLOPS, in_features=3584, out_features=2560, bias=False)
  930. gpu009: (up_proj): Linear(0 = 0% Params, 20.46 GMACs = 0% MACs, 981.81 us = 0% latency, 41.68 TFLOPS, in_features=3584, out_features=2560, bias=False)
  931. gpu009: (down_proj): Linear(0 = 0% Params, 20.46 GMACs = 0% MACs, 1 ms = 0% latency, 40.78 TFLOPS, in_features=2560, out_features=3584, bias=False)
  932. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 870.23 us = 0% latency, 3.28 GFLOPS)
  933. gpu009: )
  934. gpu009: (53): Qwen2MoeMLP(
  935. gpu009: 0 = 0% Params, 81.75 GMACs = 0.02% MACs, 8.75 ms = 0.04% latency, 18.68 TFLOPS
  936. gpu009: (gate_proj): Linear(0 = 0% Params, 27.25 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 53.35 TFLOPS, in_features=3584, out_features=2560, bias=False)
  937. gpu009: (up_proj): Linear(0 = 0% Params, 27.25 GMACs = 0.01% MACs, 990.87 us = 0% latency, 55 TFLOPS, in_features=3584, out_features=2560, bias=False)
  938. gpu009: (down_proj): Linear(0 = 0% Params, 27.25 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 53.41 TFLOPS, in_features=2560, out_features=3584, bias=False)
  939. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 876.67 us = 0% latency, 4.34 GFLOPS)
  940. gpu009: )
  941. gpu009: (54): Qwen2MoeMLP(
  942. gpu009: 0 = 0% Params, 140.38 GMACs = 0.03% MACs, 8.74 ms = 0.04% latency, 32.12 TFLOPS
  943. gpu009: (gate_proj): Linear(0 = 0% Params, 46.79 GMACs = 0.01% MACs, 1.04 ms = 0% latency, 90.09 TFLOPS, in_features=3584, out_features=2560, bias=False)
  944. gpu009: (up_proj): Linear(0 = 0% Params, 46.79 GMACs = 0.01% MACs, 1 ms = 0% latency, 93.57 TFLOPS, in_features=3584, out_features=2560, bias=False)
  945. gpu009: (down_proj): Linear(0 = 0% Params, 46.79 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 91.52 TFLOPS, in_features=2560, out_features=3584, bias=False)
  946. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 869.51 us = 0% latency, 7.51 GFLOPS)
  947. gpu009: )
  948. gpu009: (55): Qwen2MoeMLP(
  949. gpu009: 0 = 0% Params, 129.92 GMACs = 0.03% MACs, 8.58 ms = 0.04% latency, 30.29 TFLOPS
  950. gpu009: (gate_proj): Linear(0 = 0% Params, 43.31 GMACs = 0.01% MACs, 1.06 ms = 0% latency, 81.73 TFLOPS, in_features=3584, out_features=2560, bias=False)
  951. gpu009: (up_proj): Linear(0 = 0% Params, 43.31 GMACs = 0.01% MACs, 978.95 us = 0% latency, 88.48 TFLOPS, in_features=3584, out_features=2560, bias=False)
  952. gpu009: (down_proj): Linear(0 = 0% Params, 43.31 GMACs = 0.01% MACs, 1 ms = 0% latency, 86.27 TFLOPS, in_features=2560, out_features=3584, bias=False)
  953. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 857.83 us = 0% latency, 7.04 GFLOPS)
  954. gpu009: )
  955. gpu009: (56): Qwen2MoeMLP(
  956. gpu009: 0 = 0% Params, 3.69 GMACs = 0% MACs, 8.55 ms = 0.04% latency, 862.63 GFLOPS
  957. gpu009: (gate_proj): Linear(0 = 0% Params, 1.23 GMACs = 0% MACs, 1.03 ms = 0% latency, 2.4 TFLOPS, in_features=3584, out_features=2560, bias=False)
  958. gpu009: (up_proj): Linear(0 = 0% Params, 1.23 GMACs = 0% MACs, 980.85 us = 0% latency, 2.51 TFLOPS, in_features=3584, out_features=2560, bias=False)
  959. gpu009: (down_proj): Linear(0 = 0% Params, 1.23 GMACs = 0% MACs, 1.01 ms = 0% latency, 2.44 TFLOPS, in_features=2560, out_features=3584, bias=False)
  960. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 863.79 us = 0% latency, 198.57 MFLOPS)
  961. gpu009: )
  962. gpu009: (57): Qwen2MoeMLP(
  963. gpu009: 0 = 0% Params, 75.97 GMACs = 0.02% MACs, 12.62 ms = 0.05% latency, 12.04 TFLOPS
  964. gpu009: (gate_proj): Linear(0 = 0% Params, 25.32 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 49.59 TFLOPS, in_features=3584, out_features=2560, bias=False)
  965. gpu009: (up_proj): Linear(0 = 0% Params, 25.32 GMACs = 0.01% MACs, 990.15 us = 0% latency, 51.15 TFLOPS, in_features=3584, out_features=2560, bias=False)
  966. gpu009: (down_proj): Linear(0 = 0% Params, 25.32 GMACs = 0.01% MACs, 1.04 ms = 0% latency, 48.86 TFLOPS, in_features=2560, out_features=3584, bias=False)
  967. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 4.8 ms = 0.02% latency, 736.28 MFLOPS)
  968. gpu009: )
  969. gpu009: (58): Qwen2MoeMLP(
  970. gpu009: 0 = 0% Params, 205.17 GMACs = 0.05% MACs, 8.48 ms = 0.04% latency, 48.4 TFLOPS
  971. gpu009: (gate_proj): Linear(0 = 0% Params, 68.39 GMACs = 0.02% MACs, 1.02 ms = 0% latency, 134.26 TFLOPS, in_features=3584, out_features=2560, bias=False)
  972. gpu009: (up_proj): Linear(0 = 0% Params, 68.39 GMACs = 0.02% MACs, 989.91 us = 0% latency, 138.18 TFLOPS, in_features=3584, out_features=2560, bias=False)
  973. gpu009: (down_proj): Linear(0 = 0% Params, 68.39 GMACs = 0.02% MACs, 1.02 ms = 0% latency, 134.61 TFLOPS, in_features=2560, out_features=3584, bias=False)
  974. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 866.17 us = 0% latency, 11.02 GFLOPS)
  975. gpu009: )
  976. gpu009: (59): Qwen2MoeMLP(
  977. gpu009: 0 = 0% Params, 99.37 GMACs = 0.02% MACs, 8.37 ms = 0.04% latency, 23.75 TFLOPS
  978. gpu009: (gate_proj): Linear(0 = 0% Params, 33.12 GMACs = 0.01% MACs, 1.02 ms = 0% latency, 64.8 TFLOPS, in_features=3584, out_features=2560, bias=False)
  979. gpu009: (up_proj): Linear(0 = 0% Params, 33.12 GMACs = 0.01% MACs, 984.67 us = 0% latency, 67.28 TFLOPS, in_features=3584, out_features=2560, bias=False)
  980. gpu009: (down_proj): Linear(0 = 0% Params, 33.12 GMACs = 0.01% MACs, 1.01 ms = 0% latency, 65.79 TFLOPS, in_features=2560, out_features=3584, bias=False)
  981. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 860.21 us = 0% latency, 5.37 GFLOPS)
  982. gpu009: )
  983. gpu009: (60): Qwen2MoeMLP(
  984. gpu009: 0 = 0% Params, 1.05 GMACs = 0% MACs, 8.56 ms = 0.04% latency, 244.29 GFLOPS
  985. gpu009: (gate_proj): Linear(0 = 0% Params, 348.65 MMACs = 0% MACs, 1.02 ms = 0% latency, 682.7 GFLOPS, in_features=3584, out_features=2560, bias=False)
  986. gpu009: (up_proj): Linear(0 = 0% Params, 348.65 MMACs = 0% MACs, 980.14 us = 0% latency, 711.43 GFLOPS, in_features=3584, out_features=2560, bias=False)
  987. gpu009: (down_proj): Linear(0 = 0% Params, 348.65 MMACs = 0% MACs, 1.03 ms = 0% latency, 675.76 GFLOPS, in_features=2560, out_features=3584, bias=False)
  988. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 866.89 us = 0% latency, 56.11 MFLOPS)
  989. gpu009: )
  990. gpu009: (61): Qwen2MoeMLP(
  991. gpu009: 0 = 0% Params, 33.36 GMACs = 0.01% MACs, 8.84 ms = 0.04% latency, 7.55 TFLOPS
  992. gpu009: (gate_proj): Linear(0 = 0% Params, 11.12 GMACs = 0% MACs, 1.03 ms = 0% latency, 21.49 TFLOPS, in_features=3584, out_features=2560, bias=False)
  993. gpu009: (up_proj): Linear(0 = 0% Params, 11.12 GMACs = 0% MACs, 989.91 us = 0% latency, 22.47 TFLOPS, in_features=3584, out_features=2560, bias=False)
  994. gpu009: (down_proj): Linear(0 = 0% Params, 11.12 GMACs = 0% MACs, 1.01 ms = 0% latency, 21.99 TFLOPS, in_features=2560, out_features=3584, bias=False)
  995. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 867.13 us = 0% latency, 1.79 GFLOPS)
  996. gpu009: )
  997. gpu009: (62): Qwen2MoeMLP(
  998. gpu009: 0 = 0% Params, 9.63 GMACs = 0% MACs, 8.66 ms = 0.04% latency, 2.22 TFLOPS
  999. gpu009: (gate_proj): Linear(0 = 0% Params, 3.21 GMACs = 0% MACs, 1.03 ms = 0% latency, 6.24 TFLOPS, in_features=3584, out_features=2560, bias=False)
  1000. gpu009: (up_proj): Linear(0 = 0% Params, 3.21 GMACs = 0% MACs, 983.48 us = 0% latency, 6.53 TFLOPS, in_features=3584, out_features=2560, bias=False)
  1001. gpu009: (down_proj): Linear(0 = 0% Params, 3.21 GMACs = 0% MACs, 1.02 ms = 0% latency, 6.32 TFLOPS, in_features=2560, out_features=3584, bias=False)
  1002. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 862.36 us = 0% latency, 519.5 MFLOPS)
  1003. gpu009: )
  1004. gpu009: (63): Qwen2MoeMLP(
  1005. gpu009: 0 = 0% Params, 4.07 GMACs = 0% MACs, 8.57 ms = 0.04% latency, 951.01 GFLOPS
  1006. gpu009: (gate_proj): Linear(0 = 0% Params, 1.36 GMACs = 0% MACs, 1.02 ms = 0% latency, 2.68 TFLOPS, in_features=3584, out_features=2560, bias=False)
  1007. gpu009: (up_proj): Linear(0 = 0% Params, 1.36 GMACs = 0% MACs, 983 us = 0% latency, 2.76 TFLOPS, in_features=3584, out_features=2560, bias=False)
  1008. gpu009: (down_proj): Linear(0 = 0% Params, 1.36 GMACs = 0% MACs, 1.03 ms = 0% latency, 2.65 TFLOPS, in_features=2560, out_features=3584, bias=False)
  1009. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 867.13 us = 0% latency, 218.47 MFLOPS)
  1010. gpu009: )
  1011. gpu009: )
  1012. gpu009: (shared_expert): Qwen2MoeMLP(
  1013. gpu009: 0 = 0% Params, 7.22 TMACs = 1.61% MACs, 41.25 ms = 0.17% latency, 349.88 TFLOPS
  1014. gpu009: (gate_proj): Linear(0 = 0% Params, 2.41 TMACs = 0.54% MACs, 8.54 ms = 0.04% latency, 563 TFLOPS, in_features=3584, out_features=20480, bias=False)
  1015. gpu009: (up_proj): Linear(0 = 0% Params, 2.41 TMACs = 0.54% MACs, 8.54 ms = 0.04% latency, 562.95 TFLOPS, in_features=3584, out_features=20480, bias=False)
  1016. gpu009: (down_proj): Linear(0 = 0% Params, 2.41 TMACs = 0.54% MACs, 8.35 ms = 0.04% latency, 576.38 TFLOPS, in_features=20480, out_features=3584, bias=False)
  1017. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 909.81 us = 0% latency, 368.81 GFLOPS)
  1018. gpu009: )
  1019. gpu009: (shared_expert_gate): Linear(3.58 K = 0.83% Params, 117.44 MMACs = 0% MACs, 1.05 ms = 0% latency, 223.04 GFLOPS, in_features=3584, out_features=1, bias=False)
  1020. gpu009: )
  1021. gpu009: (input_layernorm): Qwen2MoeRMSNorm(3.58 K = 0.83% Params, 0 MACs = 0% MACs, 1.63 ms = 0.01% latency, 0 FLOPS)
  1022. gpu009: (post_attention_layernorm): Qwen2MoeRMSNorm(3.58 K = 0.83% Params, 0 MACs = 0% MACs, 1.57 ms = 0.01% latency, 0 FLOPS)
  1023. gpu009: )
  1024. gpu009: (2): Qwen2MoeDecoderLayer(
  1025. gpu009: 15.36 K = 3.54% Params, 15.4 TMACs = 3.43% MACs, 822.91 ms = 3.47% latency, 37.43 TFLOPS
  1026. gpu009: (self_attn): Qwen2MoeFlashAttention2(
  1027. gpu009: 4.61 K = 1.06% Params, 962.07 GMACs = 0.21% MACs, 27.81 ms = 0.12% latency, 69.2 TFLOPS
  1028. gpu009: (q_proj): Linear(3.58 K = 0.83% Params, 420.91 GMACs = 0.09% MACs, 1.9 ms = 0.01% latency, 442.51 TFLOPS, in_features=3584, out_features=3584, bias=True)
  1029. gpu009: (k_proj): Linear(512 = 0.12% Params, 60.13 GMACs = 0.01% MACs, 1.17 ms = 0% latency, 102.63 TFLOPS, in_features=3584, out_features=512, bias=True)
  1030. gpu009: (v_proj): Linear(512 = 0.12% Params, 60.13 GMACs = 0.01% MACs, 1.16 ms = 0% latency, 103.32 TFLOPS, in_features=3584, out_features=512, bias=True)
  1031. gpu009: (o_proj): Linear(0 = 0% Params, 420.91 GMACs = 0.09% MACs, 1.62 ms = 0.01% latency, 520.23 TFLOPS, in_features=3584, out_features=3584, bias=False)
  1032. gpu009: (rotary_emb): Qwen2MoeRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 994.92 us = 0% latency, 0 FLOPS)
  1033. gpu009: )
  1034. gpu009: (mlp): Qwen2MoeSparseMoeBlock(
  1035. gpu009: 3.58 K = 0.83% Params, 14.44 TMACs = 3.22% MACs, 777.17 ms = 3.28% latency, 37.16 TFLOPS
  1036. gpu009: (gate): Linear(0 = 0% Params, 7.52 GMACs = 0% MACs, 1.11 ms = 0% latency, 13.58 TFLOPS, in_features=3584, out_features=64, bias=False)
  1037. gpu009: (experts): ModuleList(
  1038. gpu009: (0): Qwen2MoeMLP(
  1039. gpu009: 0 = 0% Params, 70.46 GMACs = 0.02% MACs, 9.16 ms = 0.04% latency, 15.39 TFLOPS
  1040. gpu009: (gate_proj): Linear(0 = 0% Params, 23.49 GMACs = 0.01% MACs, 1.14 ms = 0% latency, 41.15 TFLOPS, in_features=3584, out_features=2560, bias=False)
  1041. gpu009: (up_proj): Linear(0 = 0% Params, 23.49 GMACs = 0.01% MACs, 1.1 ms = 0% latency, 42.56 TFLOPS, in_features=3584, out_features=2560, bias=False)
  1042. gpu009: (down_proj): Linear(0 = 0% Params, 23.49 GMACs = 0.01% MACs, 1.13 ms = 0% latency, 41.68 TFLOPS, in_features=2560, out_features=3584, bias=False)
  1043. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 1.05 ms = 0% latency, 3.12 GFLOPS)
  1044. gpu009: )
  1045. gpu009: (1): Qwen2MoeMLP(
  1046. gpu009: 0 = 0% Params, 85.82 GMACs = 0.02% MACs, 9.17 ms = 0.04% latency, 18.72 TFLOPS
  1047. gpu009: (gate_proj): Linear(0 = 0% Params, 28.61 GMACs = 0.01% MACs, 1.14 ms = 0% latency, 50.28 TFLOPS, in_features=3584, out_features=2560, bias=False)
  1048. gpu009: (up_proj): Linear(0 = 0% Params, 28.61 GMACs = 0.01% MACs, 1.11 ms = 0% latency, 51.44 TFLOPS, in_features=3584, out_features=2560, bias=False)
  1049. gpu009: (down_proj): Linear(0 = 0% Params, 28.61 GMACs = 0.01% MACs, 1.12 ms = 0% latency, 51.23 TFLOPS, in_features=2560, out_features=3584, bias=False)
  1050. gpu009: (act_fn): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 967.03 us = 0% latency, 4.13 GFLOPS)
  1051. gpu009: )

gpt4o-128k:Qwen2-57B-A14B Mixture-of-Experts (MoE) 模型详解

概述

Qwen2-57B-A14B 是阿里巴巴集团发布的 Qwen2 系列中的一种 Mixture-of-Experts 模型。该模型总参数量为 57B(570 亿),但每个 token 只激活其中的 14B(140 亿)参数。这种模型架构旨在平衡模型复杂度与计算效率,提升模型在不同任务上的表现。

主要参数解释
  1. **总参数量 (Total Parameters)**:57B,即 570 亿参数。
  2. **激活参数量 (Activated Parameters)**:14B,即 140 亿参数。这意味着每个 token 在计算时,仅有 14B 参数被激活。
  3. **隐藏层大小 (Hidden Size)**:3584
  4. **层数 (# Layers)**:28
  5. **查询头数 (# Query Heads)**:28
  6. **键值头数 (# KV Heads)**:4
  7. **头大小 (Head Size)**:128
  8. **中间层大小 (Intermediate Size)**:2560
  9. **专家数量 (# Experts)**:64,其中激活的专家数量为 8,此外还有共享专家。
  10. **词汇表大小 (Vocabulary Size)**:151,646
模型架构

Qwen2-57B-A14B 采用的是 Transformer 架构,配合 Mixture-of-Experts (MoE) 机制。其核心组件包括:

DeepSpeed Flops Profiler 分析

以下是 DeepSpeed Flops Profiler 提供的一些关键性能指标:

  1. Embedding 层:主要用于输入 token 的嵌入表示。

  2. Qwen2MoeDecoderLayer:包含自注意力机制(Qwen2MoeFlashAttention2)和 MoE 块(Qwen2MoeSparseMoeBlock)。 ##### 关键组件详解

  3. **自注意力机制 (Qwen2MoeFlashAttention2)**:- 主要包括查询、键、值的线性投影 (Linear) 和旋转嵌入 (RotaryEmbedding)。- 通过减少 KV 头的数量,优化了内存使用,提升了推理速度。

  4. **Mixture-of-Experts 块 (Qwen2MoeSparseMoeBlock)**:- 包含多个专家,每个专家为一个独立的前馈神经网络 (Qwen2MoeMLP)。- 使用门控网络 (gate) 来选择激活哪些专家。

  5. **前馈神经网络 (Qwen2MoeMLP)**:- 包含 gate_proj、up_proj 和 down_proj 三个线性层。- 使用 SiLU 作为激活函数。

  6. **归一化层 (Qwen2MoeRMSNorm)**:- 用于输入和注意力后的归一化处理。

  7. **世界规模 (World Size)**:16

  8. **数据并行规模 (Data Parallel Size)**:16

  9. **模型并行规模 (Model Parallel Size)**:1

  10. **每 GPU 的批次大小 (Batch Size per GPU)**:16

  11. **每 GPU 的参数量 (Params per GPU)**:433.66K

  12. **每 GPU 的前向 MACs (Fwd MACs per GPU)**:449.07 TMACs

  13. **每 GPU 的前向 FLOPs (Fwd FLOPs per GPU)**:898.15 T

  14. **前向延迟 (Fwd Latency)**:23.87 s

  15. **前向+反向 FLOPs (Fwd+Bwd FLOPs per GPU)**:31.5 TFLOPS

  16. **迭代延迟 (Iter Latency)**:86.9 s

  17. Fwd MACs per GPU (449.07 TMACs): The number of multiply-accumulate operations (MACs) performed during a forward pass on each GPU, in teraMACs (trillions of MACs).乘法累加运算

细节分析
性能细节
  1. Qwen2MoeForCausalLM:顶层模块,包含所有的参数和计算。
  2. Qwen2MoeModel:包含嵌入层和多个解码层 (Qwen2MoeDecoderLayer)。
  3. Qwen2MoeDecoderLayer:每层包含自注意力机制和 MoE 块。- 自注意力机制 (Qwen2MoeFlashAttention2) 处理查询、键和值的投影,并包含旋转嵌入。- MoE 块 (Qwen2MoeSparseMoeBlock) 包含多个前馈神经网络 (Qwen2MoeMLP),每个前馈神经网络都有 gate_proj、up_proj 和 down_proj 线性层。
  4. Qwen2MoeMLP:前馈神经网络的核心组件,使用 SiLU 作为激活函数,并通过 gate_proj、up_proj 和 down_proj 三个线性层进行计算。
  5. Qwen2MoeRMSNorm:用于层归一化,确保训练的稳定性。
  6. 前向和反向传播性能:每 GPU 达到 31.5 TFLOPS,整体性能较高。
  7. 内存和计算优化:通过减少 KV 头数量和使用高效的注意力机制,优化了内存使用和计算效率。

综上所述,Qwen2-57B-A14B 通过 Mixture-of-Experts 机制和优化的 Transformer 架构,实现了在模型规模和计算效率之间的平衡,能够在多种任务上表现出色,且具备较高的计算性能。

  1. {'loss': 1.4182, 'grad_norm': 0.27865862445647444, 'learning_rate': 9.996199949332658e-06, 'epoch': 0.0}
  2. gpu009: [2024-07-16 20:41:53,521] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 632.55
  3. gpu009: [2024-07-16 20:41:53,522] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 13170.53 | bwd_microstep: 50444.48 | bwd_inner_microstep: 50416.57 | bwd_allreduce_microstep: 27.77 | step_microstep: 875.33
  4. gpu009: [2024-07-16 20:41:53,523] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 13170.52 | bwd: 50444.48 | bwd_inner: 50416.57 | bwd_allreduce: 27.82 | step: 862.07
  5. gpu009: [2024-07-16 20:42:59,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 589.17
  6. gpu009: [2024-07-16 20:42:59,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 13154.64 | bwd_microstep: 50845.34 | bwd_inner_microstep: 50790.42 | bwd_allreduce_microstep: 54.82 | step_microstep: 920.00
  7. gpu009: [2024-07-16 20:42:59,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 13154.64 | bwd: 50845.34 | bwd_inner: 50790.41 | bwd_allreduce: 54.84 | step: 920.00
  8. gpu009: [2024-07-16 20:44:05,911] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 597.05
  9. gpu009: [2024-07-16 20:44:05,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 13262.96 | bwd_microstep: 50931.91 | bwd_inner_microstep: 50892.46 | bwd_allreduce_microstep: 39.36 | step_microstep: 894.01
  10. gpu009: [2024-07-16 20:44:05,912] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 13262.94 | bwd: 50931.90 | bwd_inner: 50892.46 | bwd_allreduce: 39.37 | step: 894.01

Claude-3.5-Sonnet

Poe

这个训练日志提供了关于模型训练过程的详细信息。让我逐条解释:

这些日志对于监控训练进度、诊断性能瓶颈以及优化训练过程非常有用。它们提供了训练过程中各个阶段的详细时间breakdown,有助于开发者了解哪些部分可能需要优化。

  1. 首先是训练指标:复制{'loss': 1.4182, 'grad_norm': 0.27865862445647444, 'learning_rate': 9.996199949332658e-06, 'epoch': 0.0}- loss: 当前批次的损失值,为1.4182- grad_norm: 梯度范数,为0.2787- learning_rate: 当前学习率,约为1e-5- epoch: 当前训练轮数,为0.0(刚开始训练)
  2. 接下来是每个迭代步骤的时间统计(以毫秒为单位): 对于每个迭代,日志记录了以下时间:- optimizer_step: 优化器更新参数所需时间- fwd_microstep: 前向传播时间- bwd_microstep: 反向传播总时间- bwd_inner_microstep: 反向传播内部计算时间- bwd_allreduce_microstep: 梯度的all-reduce操作时间(在分布式训练中用于同步梯度)- step_microstep: 整个步骤的总时间例如,第一个迭代
  3. :- 前向传播用时约13.17秒- 反向传播用时约50.44秒- 梯度all-reduce用时约27.77毫秒- 整个步骤用时约875.33毫秒
  4. 日志显示了三个连续的训练迭代,可以看到每个迭代的时间略有不同,但总体保持一致。
  5. "gpu009"和"[Rank 0]"表示这些日志来自编号为009的GPU,且是分布式训练中的rank 0(主节点)。
  6. 时间戳(如[2024-07-16 20:41:53,521])显示了每条日志的精确记录时间。

本文转载自: https://blog.csdn.net/sinat_37574187/article/details/140440676
版权归原作者 AI生成曾小健 所有, 如有侵权,请联系我们删除。

“Qwen2-57B-A14B预训练”的评论:

还没有评论