Book Facts
Only verified fields from this page are shown here.
- Title
- DeepSeek-V3 Technical Report
- Author
- DeepSeek-AI
- Reading Time
- 15.0 minutes
- Category
- Technology & The Future
- Audio
- Not available
Quick Answers
Start with the most useful search-style answers about DeepSeek-V3 Technical Report.
What is DeepSeek-V3 Technical Report about?
## DeepSeek-V3 Technical Report This report introduces DeepSeek-V3, a powerful Mixture-of-Experts language model with 671B parameters.
Who is DeepSeek-AI?
DeepSeek AI is a company focused on artificial intelligence research and development. The team consists of researchers and engineers dedicated to advancing the...
Who should read DeepSeek-V3 Technical Report?
AI researchers, machine learning engineers, and developers interested in large language models, mixture-of-experts architectures, and efficient traini...
What is the background behind DeepSeek-V3 Technical Report?
The development of DeepSeek-V3 occurs within a broader context of rapid advancements in Large Language Models (LLMs).
Key Points
DeepSeek-V3 Technical Report
This report introduces DeepSeek-V3, a powerful Mixture-of-Experts language model with 671B parameters. It details the architecture, training infrastructure, pre-training, and post-training processes. Evaluations show DeepSeek-V3 rivals closed-source models in performance while maintaining cost-effectiveness.
By reading this, you'll:
- Understand the architecture and training of a state-of-the-art language model.
- Learn about innovative techniques for efficient training and inference.
- Gain insights into the performance and capabilities of DeepSeek-V3.
Core Content:
1. Architecture Innovations:
- Multi-head Latent Attention (MLA): Uses low-rank joint compression for attention keys and values to reduce Key-Value (KV) cache during inference, maintaining performance while minimizing memory usage.
- DeepSeekMoE: Employs finer-grained experts and isolates some as shared ones for cost-effective training.
- Auxiliary-Loss-Free Load Balancing: Pioneers a strategy to minimize the performance impact of encouraging load balancing by introducing a bias term for each expert, dynamically adjusted during training to maintain balance without relying on auxiliary losses.
- Multi-Token Prediction (MTP): Extends the prediction scope to multiple future tokens at each position to densify training signals and pre-plan representations.
2. Infrastructure Optimizations:
- DualPipe: An efficient pipeline parallelism algorithm that overlaps forward and backward computation-communication phases to accelerate model training and reduce pipeline bubbles.
- Efficient Cross-Node All-to-All Communication Kernels: Custom kernels designed to fully utilize InfiniBand (IB) and NVLink bandwidths, conserving Streaming Multiprocessors (SMs) dedicated to communication.
- Memory Saving Techniques: Recomputation of RMSNorm and MLA up-projection, exponential moving average in CPU, and shared embedding and output head for multi-token prediction to reduce memory footprint during training.
3. FP8 Mixed Precision Training Framework:
- Fine-Grained Quantization: Applies scaling at a more granular level (tile-wise for activations, block-wise for weights) to better accommodate outliers and improve quantization accuracy.
- Increased Accumulation Precision: Promotes to CUDA Cores at intervals for high-precision accumulation, addressing underflow issues in low-precision GEMM operations.
- Low-Precision Storage and Communication: Compresses cached activations and optimizer states into lower-precision formats (BF16 for optimizer states, E5M6 for specific activations) to reduce memory consumption and communication overhead.
4. Pre-Training Strategies:
- Optimized Data Corpus: Enhanced ratio of mathematical and programming samples and expanded multilingual coverage.
- Document Packing and Fill-in-Middle (FIM): Implements document packing for data integrity and incorporates FIM strategy to enable the model to accurately predict middle text based on contextual cues.
- Long Context Extension: Applies YaRN for context extension and performs two training phases to progressively expand the context window from 4K to 128K.
5. Post-Training Methodologies:
- Supervised Fine-Tuning (SFT): Curates instruction-tuning datasets across multiple domains, leveraging an internal DeepSeek-R1 model for reasoning data and DeepSeek-V2.5 for non-reasoning data.
- Reinforcement Learning (RL): Employs a rule-based Reward Model (RM) and a model-based RM, utilizing Group Relative Policy Optimization (GRPO) to align the model with human preferences and enhance performance.
- Distillation from DeepSeek-R1: Distills reasoning capabilities from the long Chain-of-Thought (CoT) model DeepSeek-R1, incorporating verification and reflection patterns into DeepSeek-V3.
Q&A
Q: What is Multi-head Latent Attention (MLA)?
A: MLA is an attention mechanism that uses low-rank joint compression for attention keys and values to reduce Key-Value (KV) cache during inference, maintaining performance while minimizing memory usage.
Q: What is Auxiliary-Loss-Free Load Balancing?
A: It's a load balancing strategy that introduces a bias term for each expert, dynamically adjusted during training to maintain balance without relying on auxiliary losses, minimizing the performance impact of encouraging load balancing.
Q: How does DualPipe improve training efficiency?
A: DualPipe is a pipeline parallelism algorithm that overlaps forward and backward computation-communication phases, reducing pipeline bubbles and accelerating model training.
Q: What is fine-grained quantization in FP8 training?
A: Fine-grained quantization applies scaling at a granular level (tile-wise for activations, block-wise for weights) to better accommodate outliers and improve the accuracy of FP8 training.
Q: What role does DeepSeek-R1 play in post-training?
A: DeepSeek-R1 is used as a model to distill reasoning capabilities, incorporating verification and reflection patterns into DeepSeek-V3 through supervised fine-tuning and reinforcement learning.
MindMap
Target Audience
AI researchers, machine learning engineers, and developers interested in large language models, mixture-of-experts architectures, and efficient training techniques. It is also relevant to those studying the advancements and capabilities of open-source language models compared to closed-source alternatives.
Historical Context
The development of DeepSeek-V3 occurs within a broader context of rapid advancements in Large Language Models (LLMs). Both closed-source and open-source models are striving to achieve Artificial General Intelligence (AGI). DeepSeek-V3 builds upon previous iterations, incorporating architectural improvements and training strategies to push the boundaries of open-source model capabilities.