Imagine a professional musician being able to explore new compositions without having to play a single note on an instrument. Or an indie game developer populating virtual worlds with realistic sound effects and ambient noise on a shoestring budget. Or a small business owner adding a soundtrack to their latest Instagram post with ease. That’s the promise of AudioCraft — our simple framework that generates high-quality, realistic audio and music from text-based user inputs after training on raw audio signals as opposed to MIDI or piano rolls. AudioCraft consists of three models: MusicGen, AudioGen, and EnCodec. MusicGen generates music from text-based user inputs, while AudioGen generates audio from text-based user inputs. We’re excited to release an improved version of our EnCodec decoder, our pre-trained AudioGen model, and all of the AudioCraft model weights and code. Researchers and practitioners now have access to train their own models with their own datasets and help advance the state of the art in generative audio.