Blog Logo
TAGS

LONGNET: Scaling Transformers to 1,000,000,000 Tokens

Scaling sequence length has become a critical demand in the era of large language models. In this work, we introduce LONG NET, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing performance on shorter sequences. LONG NET has significant advantages including linear computation complexity, dilated attention, and serving as a distributed trainer for extremely long sequences. Experiments results demonstrate strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences and code is available at https://aka.ms/LongNet.