Shifting cost optimisation left: Spotify Backstage Cost Insights

Sending
User Review
0 (0 votes)

Spotify is very much an engineering led company. It is also a product led company. Therefore when it decided it needed to get a better handle on its cloud spend, it decided to build an internal product designed to be used by its engineers. The result is Cost Insights, a plugin for Spotify’s Backstage developer portal. The idea is that engineers and engineering teams are incentivised to take more responsibility for the costs associated with the products they’re building. Modelling cost becomes part of the engineering process, rather than being a separate process for finance teams to manage.

Spotify cut its annual cloud spend by millions of dollars by helping engineers to make better decisions about resource allocation. Why was this so important? The answer is pretty simple. Infrastructure costs were outpacing user acquisition so Spotify set out to bend that curve to better align engineering and business goals.

When forecasted cloud costs are growing at a rate faster than income and revenue, it’s time to make changes. Naturally, Spotify thought it through as an engineering problem.

Spotify has been successful with its approach and is now ready to help other companies that are having the same problem. Spotify therefore open sourced Cost Insights in October 2020 as part of its broader open source and community efforts with Backstage.

Spotify certainly isn’t alone. As organisations migrate more and more workloads to the cloud, and build new applications and services there, they’re finding that cost sprawl is very real. Service ownership is a problem – who is responsible for the bill, and keeping costs down?

I spoke to the Spotify Cost Insights team in some depth to learn more. In this post we will look at Spotify’s culture, approach, technology and what it hopes to achieve with Cost Insights, but also how other companies can begin to adopt some of the same approaches based on Spotify’s practices. This research is based on interviews with Spotify employees, which are found in the companion videos, embedded below. These include a RedMonk Conversation about the issues raised, but also a video in our What Is/How To format, which includes a demo (see below for that sweet, sweet video content.)

At Spotify, speed is a core cultural principle.

CEO and founder Daniel Ek sums it up:

We aim to make mistakes faster than anyone else.

With cloud computing though, focusing on speed means costs can spiral out of control. The Spotify cost optimisation team has spent a lot of time ensuring they’re building tools that are relevant to both new product teams, and mature businesses we have at Spotify, and reconcile that with sustainable cloud costs.

Spotify has almost five hundred engineering teams today. It does about twenty thousand deployments a day across thousands of micro services. That’s a huge amount of complexity to manage, and it’s not really amenable to simply pointing at spreadsheets and saying: “we need to fix this”. Instead Spotify hacked the culture. It went back to basics and thought about cost as a problem to be solved with its engineering culture.

Spotify engineering teams are used to having a lot of autonomy, so the company couldn’t just introduce new cost guardrails as a top down concern. Autonomy is a core value at the firm, so Cost Insights was built with that in mind. Therefore the Cost team tried to foster a culture where optimisation would be fun.

Cost Insights already had a natural home.

Backstage is Spotify’s developer portal. The company’s entire software delivery supply chain is managed with Backstage – all components, data, pipelines, and services are managed using the platform, from idea to production, including monitoring and observability. The core idea is to provide a single, consistent UI for all infrastructure tooling, services, and documentation. Backstage is the company’s front end for all engineering tasks. Both developer and operational workflows are integrated into one console. Backstage is built around a service catalogue enabling teams to track the infrastructure underpinning the services and products they’re building.

Core Backstage use cases include:

  • Creating a new microservice
  • Following a pull request from review to production
  • Centralized technical documentation
  • Review performance of a team’s mobile features

Backstage abstracts a lot of complexity, whether starting a Kubernetes cluster, or provisioning a pipeline. For example – Spotify uses Jenkins under the hood for CI, but engineers don’t need to know that: they just use GitHub Enterprise and Backstage.

Backstage is not just a developer portal – a better description might be a platform for building developer portals and managing software delivery. Spotify is open sourcing these components of its own Developer Experience (DX) infrastructure, for use by the engineering community. Given that many organizations see Spotify culture and processes as aspirational – the much vaunted “Spotify Model” – it makes sense that folks would adopt their tools. We’ve seen this Netflix, and open source tools such as Chaos Monkey, Spinnaker and Zuul.

Backstage was adopted as a Sandbox Project by the Cloud Native Computing Foundation in September 2020, and there has been a steady uptick in contributors and folks building their own plugins. Spotify is being intentional about building third party interest: more than 40% of pull requests are now from external, non-Spotify, contributors.

Backstage also brings an opinion about design to the table – what plugins should look like, and how they should be built, which should be helpful as Spotify grows the community. Great user experiences can definitely be supported with effective design systems.

We believe an elegant, cohesive UX is vital to what makes Backstage such a productive, end-to-end development environment.

Backstage is a modern stack, written primarily in Typescript with a React front end. The front end code is written with React, which defines the platform’s plugin architecture. It uses the Yarn package manager and the Lerna monorepo library. Code formatting is with Prettier, with ESLint for linting. Being a modern stack the API Gateway is of course GraphQL-based.

In terms of deployment, Backstage does a solid job of supporting Kubernetes environments, notably GKE, given that’s what Spotify uses inhouse. For documentation Spotify has embraced a docs-as-code approach with MKDocs.

The fact that Backstage is already central to how engineers and engineering teams work at Spotify made it the natural home for Cost Insights. Spotify wanted cost optimisation in a place where engineers are already spending the bulk of their time, as opposed to trying to encourage them to use a different third party system.

Backstage was also a place to bootstrap Cost Insights because it is based on a service catalogue model, with the graph of services already in place. Third party organisations wanting to take advantage of this functionality are going to have to do some upfront work modelling and mapping their internal APIs and services to their own products and services, but building this kind of service map is a useful exercise for any organisation.

Cost Insights also supports labels. Engineers can label Cloud provider resources so that they match their own internal terms, components and service names, rather than relying on billing info from the Cloud provider itself.

One useful aspect here is that often in an organisation a shared infrastructure service used by multiple teams is not properly represented in terms of billing. One team ends up paying for that service, rather than it being properly allocated. Backstage modelling helps with internal chargeback.

Spotify largely views cost through the lens of engineering resources. If you invest X to automate a process or optimise a system, what are the returns in terms of Full time Employees (FTEs) – at Spotify that basically means Engineers – you could potentially hire with the savings.

Early experiences with Cost Insights allowed Spotify to fund the equivalent of 25 teams across the company. One interesting part of the Spotify cost optimisation journey is that it is not based on driving specific team OKRs. Engineers and teams don’t get the money back that they save, to spend on their own product, but rather it goes back into the company. The incentives are communal and cultural rather than directly financial. Cost Insights could be used as part of engineering bonus structures, at more financially oriented companies. But for Spotify at least the idea is simply to optimise so that everyone in the company benefits. This communitarianism probably reflects the fact Spotify is a Swedish company, with Scandinavian values.

Developers generally like specificity however, regardless of where they’re from. An engineer has trade-offs to make. If a particular service is becoming too expensive, for example, is there an alternative approach from the Cloud provider that might save money in the long run?

Is this migration worthwhile? How much will it save? How much engineering effort should they put in it? Spotify wants an engineer to be easily able to make a calculation based on how many engineers they need to use on a migration or automation in order to save further engineering resources.

If it’s going to take months of effort and only save half an engineer headcount then an “optimisation” is probably not worthwhile. It’s not enough to simply present cost information. It needs to be granular and mapped to services. Engineers want to know what’s the problem and where is it coming from? That way they can make better decisions.

One facet of Cost Insights that really strikes a chord is the focus on encouraging engineers to do more of what they already like to do.

According to Janisa Anandamohan, Spotify Senior Product Manager, Cost Engineering:

We know engineers are natural optimisers when it comes to reliability, security, performance, et cetera. And now we’re telling them, hey, add costs into the mix. And they were super excited about that. We had many teams that were just able to tweak their services and data pipelines and to make them more efficient.

And we know efficiency doesn’t just help cost, but helps reliability and performance as well. So we were getting double, triple wins along the way.

Spotify further encouraged its developers in a couple of ways. One is to provide recommendations within the portal – for example, with advice about auto-scaling strategies.
The second is with an internally crowdsourced document called Our Cookbook, for engineers to share what worked for them in terms of system optimisations, to help other teams.

Spotify realised that developers were treating cost optimisation like a game, showing off their wins, and encouraging other teams to do the same thing. This helped create a virtuous circle, to “bend the curve”. Leader board functionality is therefore on the roadmap, to help the social, competitive aspects of cost control, as teams compete to reduce costs across services, and better learn from their peers.

While Spotify expects engineers to engage and take some responsibility for cost management, sometimes they need a nudge, and different teams take to it differently. At Spotify cost management is a conversation between engineers and the business, which is represented by a specialist cost management function.

Some teams got access to Cost Insights and immediately started using it to figure optimisations and quickly put them into place. Other teams, the Spotify Cost management team works with and asks why a service is costing so much. Often the team has a good reason for cost expansion though. For example a new service is likely to incur extra cost while it’s in startup mode. A more mature service though should see margins improving.

At times when the cost management organization feels it needs to intervene, they take a collaborative approach to optimisation, for example helping the team in question to figure out how to optimise a workflow, for example, that’s running data processing expensively. It’s a consultancy model.

Spotify argues that having these conversations, bridging bottom up and top down, helps improve alignment in the business. Cost Insights takes finger-pointing out of the equation.

Spotify hasn’t just built some code for itself and licensed it as open source, then hope it sticks. It is explicitly managing the Backstage development process in terms of product management, and has a full team working on it. The business model isn’t clear yet, but Spotify clearly believes it needs to build a business model for the product in order to encourage sustainable adoption. The product has its own logo.

In the case of Backstage plans arguably go further, than for example, Netflix with Chaos Monkey. Spotify wants to take more responsibility for third party use of the software. Spinnaker might be an interesting parallel, which while it began as a project at Netflix, now has a startup, Armory, building products and services around it. Or Kafka, which began as a project at Netflix, and now has companies built on the technology, notably Confluent. Commercial Open Source is a business in its own right, and Spotify is not in that business, so it will be interesting to see how this plays out. One startup has already emerged with a hosted support model for Backstage, with a cool name. Who doesn’t want to be a Roadie and have a backstage pass at a great gig?

For example, while Cost Insights–like Backstage–is intended for use by other companies, it was initially designed for use at Spotify. This creates some interesting product management challenges. Notably, Cost Insights is primarily designed for Google Cloud, given Spotify primarily runs on GCP, but AWS support is planned. Who will pay for functionality that isn’t core to Spotify’s needs? That’s where a sustainable open source model comes in.

Cost Insights is not for the average enterprise. It’s explicitly positioned for use by high scale companies like Spotify – with its 2k+ microservices and 4k data pipelines, to help numbers be attributed to specific teams, products and services.

Spotify also shares notes on a regular basis with peer cloud companies, the biggest customers of hyperscale cloud providers, about best practices in high scale operations, technology and spend. It also uses the term FinOps, shorthand for Cloud Financial Management. The FinOps Foundation is run by the Linux Foundation.

The FinOps argument is interesting, because there is quite a difference between the needs of Spotify and its peers and those of more traditional enterprises. Even the most digital enterprise is not using anywhere near the kind of capacity of a Netflix or Uber.

There are a host of FinOps vendors out there, providing Cloud Financial Management. Cloudability, CloudHealth and ParkMyCloud, are well known vendors. And of course you have AWS Cost Explorer, Amazon’s own tool.

AWS revenues in Q4 2020 were $12.7 billion. Microsoft’s Cloud revenues were $43.1 billion for the second quarter of fiscal 2021. Google Cloud’s revenues for 2020 were 13.06 billion.

Cloud is taking an increasing share of customer IT budgets in almost all product categories That’s a lot of cost to optimise, particularly given the huge number of AWS services that customers are using, and are constantly being introduced. FinOps vendors generally come at the world from the perspective of the finance person rather than the engineer, though. As Spotify looked to reduce its spend, its first initiative–before engineering-centred initiatives like Cost Insights–was top down, focusing on better forecasting and pre-booking GCP resources.

Not everyone is a fan of the “FinOps” movement. Corey Quinn, co-founder of Duckbill Group, is almost certainly the best cost optimisation expert on the planet when it comes to high scale use of AWS services. If you’re spending tens of millions of dollars a year on AWS you should contact Corey to reduce your bill before you finish reading this post. Duckbill offers fixed fee spend management advice with a solid engineering understanding of AWS services and how to get the best out of them. Corey believes that FinOps needlessly complicates things, and that good practices by developers – for example turn things off when you’re not using them! – is the best way forward.

You should be enabling engineers to better understand their total cost of service, which brings us back to tools like Cost Insights.

We discussed labels above. One of Spotify’s earliest efforts in the space was to apply labels to every cloud resource that it could in Spotify terms and language, rather than that of the cloud provider. This means that when an engineer looks at this tool, they can immediately see things that they understand in terms of their day to day work, rather than trying to work out what it means in GCP terms.

These data are quite granular – showing license by cloud product, so for Google Data Flow or data processing, the developer sees costs broken down by pipeline, or say for Google Compute Engine, the dev sees Spotify service names. That information can then be correlated and configured with different business metrics – e.g., daily active users, or monthly subscribers. The platform is extensible because Spotify of course expects third party users to have their own KPI and metrics.

Cost isn’t the only factor that can be optimised. One interesting idea is that Cost Insights can also be used to optimise this sustainability of a service, deciding in what region, how and when a service runs, and ensuring that it takes advantage of a data centre running on renewable energy.

Spotify includes language in Backstage recommendations about optimisation for sustainability. The team built a feature where the user can switch between its normal FTE (engineer) conversion, to carbon offset tonnes.

RedMonk would like to see this sustainability view adopted by user organisations because it might help large cloud buyers to put more pressure on AWS to focus on renewable energy. AWS is on the record as saying more customer pressure will direct the firm to accelerate its sustainability efforts.

One question about shifting cost optimisation left is – what should the cadence look like? It’s not like unit testing, for example, which happens as a constant process of improvement.

Spotify found that the best way to get involvement was to encourage engineers to use Cost Insights in a cadence alongside existing quarterly planning. If there were issues that the Cost team felt needed attention, they’d alert a team before those meetings. That said, if costs are rapidly escalating out of control for a particular service, that’s something that should generate an alert, so anomaly detection is on the Cost Insights product roadmap.

All that said, Cost Insights does have a prime place in Backstage, so the information is always there for developers to consider as they make trade-offs and decisions about infrastructure choices.

The move to create a rich ecosystem around Backstage is going to be an interesting new challenge for Spotify, involving a lot of work and new trade offs around what are, or are not, core assets. Building for others may for example slow down development for functions that Spotify needs, which when velocity is a core company value will be an interesting challenge.

But Spotify is right that cloud costs are growing out of control at many companies as they increase their investments in the cloud. That being the case, they’ll be looking for tools that improve their effectiveness in that area. The market for Cost Insights won’t necessarily be huge, but it could be a very influential slice of the industry cloud spend.

[embedded content]

[embedded content]

RedMonk has done some work with Spotify, but this post was not sponsored, and is an independent piece of research. AWS, Google Cloud and Microsoft are all RedMonk clients.

(Read this and other great posts  @ RedMonk)