Language models exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions. However, they paradoxically struggle with basic functionality such as arithmetic or factual lookup, where much simpler models excel. In this paper, we introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities. Existing approaches either rely on large amounts of human annotations or limit tool use to task-specific settings only, hindering a more widespread implementation. By giving language models the ability to use external tools such as search engines, calculators, or calendars, a simple way to overcome their limitations and further improve their performance can be achieved.