Large World Model (LWM)

Large World Model (LWM) is a general-purpose large-context multimodal autoregressive model. It is trained on a large dataset of diverse long videos and books using RingAttention, and can perform language, image, and video understanding and generation. Current language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. To address these challenges, we curate a large dataset of diverse videos and books, utilize the RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens. LWM can retrieval facts across 1M context with high accuracy. LWM can answer questions over 1 hour YouTube video. LWM can chat with images. LWM can generate videos and images from text. This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities. The codebase is supported on Ubuntu and has not been tested on Windows or macOS. TPUs are recommended for training and inference, although GPUs can also be used. The code is highly optimized with Jaxs Pallas for TPUs, achieving high MFUs with RingAttention at large context sizes. On GPUs, the code is based on XLA and is not as optimized as for TPUs. Installation requirements include creating a conda environment with Python 3.10, installing jax[cuda12_pip]==0.4.23 and other dependencies using pip, and setting up a TPU VM with the provided script. Available models include language-only and video-language versions with context sizes from 32K to 1M tokens, offered in Jax and PyTorch. The names of the available models are provided along with their corresponding context sizes.

Large World Model (LWM)

Previoujs Article

Using YARP as an API Gateway and Rate Limiter

Next Article

Backstage and Terraform — A Powerful Combination for Ops and Devs

Tags