Blog Logo
TAGS

How DeepSeek Improved Transformer Architecture for Long-Context Inference

DeepSeek has recently released DeepSeek v3, which is currently state-of-the-art in benchmark performance among open-weight models for long-context inference. They achieved this by using only 2.8 million H800 hours of training hardware time. The key architectural improvement highlighted in their report is the multi-head latent attention (MLA) technique, which reduces the size of the KV cache compared to traditional methods. The KV cache is important for efficiently generating tokens sequentially during inference by caching key and value vectors of past tokens. MLA is superior to traditional methods like grouped-query and multi-query attention. The DeepSeek v3 architecture also introduces DeepSeekMoE and multi-token prediction. Overall, DeepSeeks improvements result in better performance compared to a vanilla Transformer. For more detailed insights, it is recommended to read the full technical report available on their website.