Instead of performing 10 major software upgrades to our infrastructure every year, what if we did 10,000 small ones? Across our entire codebase? That’s the idea behind Fleet Management: by building automation tools that can safely make changes to thousands of repos at once, we can maintain the health of our tech infrastructure continuously (instead of slowly and laboriously). More importantly, removing this low-level work from our developers’ to-do lists allows product teams to focus on solving problems way more interesting than migrating from Java 17.0.4 to 17.0.5. A healthier, more secure codebase, plus happier, more productive engineers. What’s not to like? In this first post about Fleet Management at Spotify, we describe what it means to adopt a fleet-first mindset — and the benefits we’ve seen so far. The problem of maintaining speed at scale Since shipping the very first app, Spotify has experienced nearly constant growth, be that in the number of users we serve, the size and breadth of our catalog (first music, then podcasts, now audiobooks), or the number of teams working on our codebase. It’s critical that our architecture supports innovation and experimentation both at a large scale and a fast pace. Many small squads, many more components We’ve found it powerful to divide our software into many small components that each of our teams can fully design, build, and operate. Teams own their own components and can independently develop and deploy them as they see fit. This is a fairly regular microservice architecture (although our architecture predates the term), applied to all types of components, be those mobile features, data pipelines, services, websites, and so on. As we’ve scaled up and expanded our business, the number of distinct components we run in production has grown and is now on the order of thousands. The number of Spotify engineers (green) vs. the number of software components (violet). Components grew at a much faster rate over time. The small stuff adds up quickly Maintaining thousands of components, even for minor updates, quickly gets arduous. More complex migrations — e.g., upgrading from Python 2 to 3 or expanding the cloud regions we’re in — take significant engineering investment from hundreds of teams over months or even years. Similarly, urgent security or reliability fixes would turn into intense coordination efforts to make sure we would patch our production environment in a timely fashion. The graph below shows the progression of a typical migration, in this case upgrading our Java runtime, pre–Fleet Management at Spotify. All in all, this single migration took eight months, about 2,000 semiautomated pull requests, and a significant amount of engineering work. A slow, hard slog: In the days before Fleet Management, we typically measured software migrations, like this update to our Java runtime, over many months. In addition to the toll this takes on a developer’s time, it also takes its toll on developer experien