How Platform Engineering Works
In April 2023, I gave my first ever public conference talk at DevOps Days in Birmingham, Alabama. I shared the lessons I have learned in my first year at Sotheby's about how to make Platform Engineering work. It was an honor to be chosen as one of the few speakers at the event. What follows is the content I used for the talk, along with the slides I used to present.
Here's a recording of the talk, thanks to the DevOps Days Birmingham organizers!
Introduction
Hi! I'm Chad. I work at Sotheby's as a Staff Infrastructure Engineer on our Platform Engineering team, known internally as the ThunderCats.
Sotheby's is primarily an Auction house for luxury goods like art, jewelry, apparel, and cars. We have about 2,000 employees worldwide and did about 8 billion in sales in 2022. We have about 70 engineers within workstreams supporting our mobile app, auction experiences, financial services, and internal business tools. There are 4 of us in Platform Engineering.
I've been at Sotheby's for about a year and a half now, and I'd like to share with you what I've learned about how Platform Engineering works.
As a former Product Engineer, being a Platform Engineer at Sotheby's this past year has been a huge opportunity for me to learn and stretch my skills. The insights I have gained are forming the foundation upon which we're building our platform and our engineering organization as a whole. I'm excited to share with you what I've learned… about what defines the core of Platform Engineering and how I've learned to apply these principles in my day-to-day work.
Upon becoming a Platform Engineer, I asked my boss "What does Platform Engineering mean to you?" He succinctly responded "Velocity and Stability". This made sense to me. "Sure, I get it" I thought. But as I went about my work, I revisited this description over and over, and found there to be a lot of depth to these words.
So now a year later, as I think about what Platform Engineering means to me, this is what I think of.
Platform Engineering is the application of a Product Mindset to supporting your engineering organization's software delivery velocity and system stability.
Not as catchy, but I think bringing a Product Mindset into the definition is a critical aspect to doing our jobs well. So let's dive in and talk some about these three main qualities: velocity, stability, and a product mindset.
Velocity and Stability are the Core Outcomes of Platform Engineering
Velocity and Stability describe the core outcomes a Platform Engineering team is driving at.
Velocity represents how fast we can provide a solution to meet a customer's needs, from idea to delivery.
Many aspects of the Software Development Lifecycle can be changed to improve delivery velocity, such as project management or agile practices, the local development environment, technology choices, review processes, the deployment pipeline, and more. Many sub-disciplines of Platform Engineering exist that take improving delivery velocity as their primary goal, such as "Build Engineering", "DevOps", "Developer Experience", and "Developer Productivity" teams.
System stability, on the other hand, represents the work required to provide a consistent experience to your company's customers.
While mostly affected by operational practices, stability is also affected by cultural norms such as incident response processes and learning from incidents through post-mortem retrospectives. A common sub-discipline of Platform Engineering that specializes in Stability is "Infrastructure Engineering". Security practices commonly found as components of "DevSecOps" also support System Stability under the Platform Engineering banner, and SRE, while not necessarily a sub-discipline, is strongly aligned.
You may recognize the metrics referred to here as the DORA "4 key metrics".
The DORA State of DevOps Report has proven year after year that maintaining both velocity and stability is key to sustaining a high-performing software delivery organization. This is WHY it is so important to have a Platform Engineering discipline in your organization. Attention to maintaining these characteristics can slip as teams focus on delivering the company's core value. Having a Platform Engineering team supports our organization's ability to focus on delivering its value, while enabling a continuously improving delivery velocity and system stability for all teams.
But Platform Engineers have many challenges to overcome to be successful. They are doing primarily internal-facing work, have fewer engineers to work with, and often lack much of the management support other team archetypes receive such as product and project management. Platform Engineering is also typically an "engineering-led" discipline, and engineers are notorious for overbuilding, under-designing, "chasing the new and shiny tech", and falling into other pitfalls resulting from a lack of customer focus.
In order for engineers to lead Platform Engineering efforts successfully, we have to learn to wear many hats. But above all, we must be hyper-focused on our customers, pragmatic about what we can achieve, approach our improvements iteratively, and set our goals on outcomes, not outputs. In short, we must adopt a Product Mindset.
Adopting a Product Mindset is how Platform Engineering Works
Adopting a Product Mindset is how Platform Engineering works.
Founder engineers know this all too well - building a product as an engineer requires not only having a good technical idea and the skill to deliver it, but also Product-Market fit, which requires customer feedback, marketing and support. Founder engineers rarely have the luxury of being able to rely on others to do these things for them, and Platform Engineers often find themselves in this same boat.
Much of the buzz around Platform Engineering suggests we think of "Platform as a Product", as if we are building what amounts to an internal SaaS product like Heroku. While this may be what some organizations' platforms look like, the deeper meaning here is that we think about our Platform through the lens of a Product Manager. This means we need to slow down and think holistically about how to deliver value to our customers - the software engineers within the organization, but more broadly, all our colleagues and the customers our company is striving to serve. This results in better outcomes because it keeps the focus on solving the problems your organization is experiencing right now.
My team has been living with this charter of improving velocity and stability along with the understanding that we must adopt a product mindset to be successful, and I want to share how we have been applying these principles to our work. My hope is that our lessons learned can add some color to the theory and provide a peek behind the curtain of what Platform Engineering looks like day-to-day.
Lesson 1: Set Goals on Outcomes, Not Outputs
The most important lesson I've learned this past year is the importance of setting goals on outcomes, not outputs.
"Outcomes over output" is a common saying we've learned from the lean product management world that helps us set meaningful goals. We definitely need outputs to achieve our outcomes, but this is meant to remind us to avoid setting goals against activities; otherwise, we may end up in a sort of "productivity theater", where there is plenty of work getting done without meaningful results. We need to ensure that whatever outputs we are creating are achieving our desired outcomes, and if not, be willing to reassess and experiment with entirely different approaches.
At Sotheby's, our Platform Engineering team integrates this outcome-based goal setting into our yearly OKR process, our quarterly planning, and our story kickoff meetings.
Use OKRs for Strategic Planning
For our OKR process, we get together in person, typically in Iceland, where two of our team members are based. During this week, we spend time revisiting our team charter, reflecting on our last year's work, and building a vision for what outcomes we want to accomplish as a team and how we want to see the engineering organization progress.
Because our goals are outcome-oriented, it gives us the freedom to be creative in how we achieve them. Building them as a team draws ideas from everyone and aligns us around our chosen path forward. This engagement results in happier engineers which in turn drives better outcomes.
Perform Quarterly Tactical Planning
Our quarterly planning follows a similar process, but with more granularity. A quarter retrospective leads to collaborative brainstorming sessions, which leads to a final prioritization exercise to add projects to our roadmap. During our collaboration sessions, we examine feedback we've received throughout the quarter, discuss patterns we've been seeing, and dig into promising ideas. These sessions result in planned initiatives for the coming quarter.
Use Story Kickoffs for Alignment
Finally, our story kickoff is a meeting we have at the beginning of each major project. We clarify what outcome we're hoping to achieve by completing the work, what's in scope and out of scope, and who's participating in the project. Finally, we nominate a leader for the story that acts as a "delivery lead", keeping everyone on track and moving towards the goal. With this meeting at the beginning of each project, we make sure that everyone working on the project is aligned around the work and the outcome we are driving towards before digging in.
These 3 core iterative processes help us to self-organize around our goals. They keep our work focused on our desired outcomes while continuously inspecting and adapting throughout the year to ensure we're making progress and our goals are still relevant.
This outcome-based goal setting drives the importance of the next lesson I want to share with you, which is truly knowing your customer.
Lesson 2: Truly Know Your Customer
Even though we're engineers with engineers as our customers, this doesn't mean we can assume we know what problems matter most to them. As Platform Engineers, we're typically not doing the same class of work as other engineers, so it's very likely their challenges look different from ours. To stay connected to their real needs, we need to acknowledge this gap and work to bridge it.
At Sotheby's we're using three main strategies for keeping a finger on their pulse - providing engineers direct support, collaborating closely with them through short-term consulting engagements, and running a biannual survey.
Support Engineers on Slack
Platform Engineers are often some of the more senior engineers in the organization, especially at smaller places like ours. This means our help is often sought out, particularly in our areas of expertise like infrastructure, cloud, databases, and for tooling we support such as GitHub Actions, Terraform, and Kubernetes. To support these needs, we host a shared Slack channel. For each request, we guarantee a response within 1 business day but typically respond much sooner. This direct, low friction, and SLA-based approach to support gives our engineers the help they need in a timely manner to keep them from being blocked. Being a public channel, engineers can also help each other if they see someone ask a question they know the answer to.
This is an important way we learn about the sources of friction in their daily work, which feeds back into our prioritization and planning. But providing asynchronous support like this only gets us so far.
Consult, for a White-Glove Approach
Sometimes a white-glove approach is a better fit for supporting engineers in achieving our organizational goals.
For example, we recently "embedded" 4 engineers on 2 different teams for a month to support them in building out Service Level Objectives. Through active pair- and mob-programming sessions, we were able to explore implementation strategies together, which shared our mental model of SLOs. We also directly helped them to implement SLIs, datadog dashboards, and even new Platform services to support ingesting SLI telemetry from our frontend applications. This collaborative way of working strengthened relationships between our teams, building trust and mutual understanding.
It also gave the Platform team a closer look at life as a product engineer and respect for the challenges they are facing. We discovered they were having issues debugging their applications and jumping to implementations of framework code in their IDEs, which was hampering their day-to-day productivity. We also found testing out gRPC endpoints was frustrating to them. These discoveries have since fed back into our planned work, where we have enabled gRPC reflection on all services, and will be spending time writing IDE setup guides to clarify configuration of a productive build/debug/test flow.
With both low-touch and high-touch supporting activities, it gives us an opportunity to see first-hand what kinds of solutions are needed to improve the velocity of our engineering teams. But there is still a gap in our understanding of their needs - what about the people that are not actively asking for support, but still have feedback or unmet needs?
Survey for a Broad Perspective
This is where it can be really helpful to run an engineering-wide survey.
We ran our first engineering survey this spring to get an unbiased understanding of the state of various high-level aspects of our engineering culture. We decided to use Qualtrics for this as a lightweight way to get starting with surveying. Google or Microsoft Forms can also work. The hard part about running surveys is primarily knowing what questions to ask, but a combination of the SPACE framework, the DORA capabilities model, and a read through the books Accelerate by Nicole Forsgren et al. and Drive by Dan Pink can go a long way to informing what a high-performing engineering culture looks like. It can still be really helpful to get support from a UX researcher for this, even if only to ensure your approach to surveying is statistically sound. If you are looking for a turnkey solution to surveying and have some budget for it, I found through my research that DX @ getdx.com is a solid choice.
Prior to this survey, we felt we may have been over-prioritizing issues from a vocal minority, while other more pressing issues were silently hampering productivity. Our first set of survey results helped us to feel confident that our roadmap is aligned with the most pressing issues. We have since prioritized reducing the amount of toil around managing infrastructure, which was a top issue highlighted by the survey.
So through surveying, white-glove consulting, and collaborating with engineers day-to-day in Slack, we can piece together a relatively comprehensive picture of where we should be applying our effort. But problems don't always come paired with solutions. The problems that most need solving are often the ones without an obvious solution. That's why, in addition to understanding what problems we should solve, another important component of Platform Engineering is inventing on behalf of our engineering organization.
Lesson 3: Don't Just Respond, Invent
By inventing, I don't mean inventing something globally new, but something new to the organization. Because we can dedicate time, energy and resources to solving these cross-cutting concerns, we can bring great solutions that serve and delight our engineers. You could call this the fun part of Platform Engineering! Indeed, this is where you'll apply your creativity and technical chops to create real value for your organization. Some ways we can do this are by introducing known-good practices from the wider tech community, bringing new ideas to existing solutions for better outcomes, and experimenting with promising next-generation technology.
Introduce Known-Good Practices
When introducing known-good practices into your organization, it helps to have a broad understanding of the kinds of problems organizations face as well as a solid understanding of the solution space to pattern match against. Nothing is a substitute for experience with good practices, but some great way to bring more onto your radar is to read books, listen to podcasts, and explore the solutions offered by your cloud vendor of choice. I've written in depth about techniques for staying in touch with the tech community if you're interested in learning more.
At Sotheby's, we recognized a lack of knowledge sharing between senior engineers across teams, and a strong interest for it. We decided to introduce a discretionary "Request for Comments (RFC)" process. My team began sharing our big ideas as RFCs to both socialize them and solicit feedback. We also created an RFC template and some lightweight documentation that engineers could lean on if they wanted some help onboarding into the process. Finally, we created a Slack channel to socialize and discuss the proposals. We didn't invent the concept of RFCs nor did we even invent the specific RFC template we decided to use, but instead used one that had worked for me in the past at a previous organization. This in turn was drawn from the excellent work of Phil Calcado and his experience working at Meetup that he shared on his blog. Good ideas can, and should, spread.
Remember though that no matter what idea it is you are looking to introduce from outside your org as an innovation, it is important to tread lightly - be careful about introducing too much at once and make sure you are addressing problems that exist today - otherwise, your changes may be seen as cargo-culting and may not be taken seriously or adopted. In my case, I decided to introduce the RFC process as opt-in, to test the idea out and see if it caught on naturally. If instead I had tried to mandate it in some way, the initiative may have failed from the start or damaged my goodwill with engineers. Instead, it is slowly gaining traction as engineers begin see they are making better decisions with the help of this tool.
So as you're solving the problems of your engineers, look for gaps in process or tooling that you are aware of an existing, known good solution for, and introduce them! But the majority of your time will be spent iterating on existing solutions.
Iterate on Existing Solutions
Iterating is at the heart of a Product Mindset because it moves us away from the "one and done" project mindset of the past. As they say:
Software is never done, only abandoned
My team has iterated on a few of our offerings over the past year - some less exciting work such as keeping 3rd-party software up-to-date, but one project, in particular, stands out: our migration from Jenkins to GitHub Actions. In early 2022, we added Service Level Objectives to our CI/CD system. One of these was an objective to have fewer than 50 main branch build failures each month, which we found we were missing more than we liked. In response to this, we decided to migrate away from Jenkins and the Kubernetes plugin we were using for our builds, and start using GitHub Actions instead. This not only addressed our flaky builds but also removes the burden of managing Jenkins, provides a better UX for engineers for debugging their builds, and makes the build process easier to understand and customize. Our first iteration of this migration was to open up GitHub Actions use for what we call "Standalone" builds, which are builds outside our main Bazel build system. This smaller, "vertical slice" of GitHub Actions support gave us experience integrating with it and also let us test out how our engineers would respond to it. The feedback we received has been overwhelmingly positive, so this coming fall we plan to finish our migration by moving our Bazel builds over and turning down our Jenkins setup.
Of course, we won't always have prior experience to pull from, nor an existing solution to iterate on. Or maybe you feel like you're hitting a local maximum in your existing solution, and need to make a leap to an entirely new solution to the same problem. In this case we need to experiment.
Experiment With New Technology
It is healthy to consider any new change as an experiment. The riskier the change, the more experimentation eases tensions and encourages action. At Sotheby's, this year we plan to experiment with a new method of provisioning infrastructure that I have been calling "Manifest Infrastructure". This involves using Kubernetes Operators such as the open-source AWS Controllers for Kubernetes (ACK) or Crossplane to define the Infrastructure as Kubernetes manifests and have the infrastructure itself be managed by the operator. We're excited about this solution because it will provide the opportunity to build higher-level abstractions for engineers to interact with, leaving the plumbing an implementation detail. Whitney and Mauricio's keynote from KubeCon Detroit 2022 illustrates this well. We plan to try this on a subset of our applications, then deploy it more broadly if things go well.
Experimentation is a key tool in your toolbelt as a Platform Engineer, both for your own work and to empower your engineers. Consider how you can enable things like blue/green or canary rollouts, feature flagging, "two-way door" decisions, and incremental delivery. Your team and your engineers will thank you for the psychological safety, learning, delivery velocity, and system reliability that result.
So by applying these core principles we're making sure we're driving our work towards specific outcomes, we're informing our work with the voice of our engineers, and we're bringing new ideas to bear to solve for our organization's needs. The final key component to being an effective Platform Engineer is to make sure we continue to have bandwidth by scaling our impact.
Lesson 4: Scale Your Impact
The power of software engineering is the ability to "automate away" our problems. Traditional IT Operators were not trained in software engineering, but Platform Engineers are, so there is an expectation that this new discipline will scale sub-linearly with the size of the engineering organization. But we shouldn't be "building all the things" just because we can. For this model to work, Platform Engineers have to internalize that we should be biased towards "doing things that scale".
We can again apply our product mindset to this by being pragmatic about what we build vs. buy and guarding our time, so we can focus on what matters.
Make Good Build vs Buy Choices
While working to solve problems within our charter, we want to try to build as little software as possible, leaning on established products and open-source solutions when available and feasible. Building and maintaining software is labor-intensive, so we want to reserve this for only those problems where the benefit will far outstrip the cost and where no alternative exists. It can feel painful to accept a solution that's only ~80% of our vision of how we may want to solve a problem, but doing so can often be "good enough" for now and free us up to solve other problems. One particularly relevant example of this was our choice to adopt OpsLevel over a more custom tool such as Backstage. Our business goals for adopting such a tool were to build out a service catalog to provide a broader understanding of the systems in our organization and how they fit together. While not nearly as customizable as building a developer portal using the Backstage framework, adopting this tool solved for our primary business needs. And OpsLevel is constantly working to improve it and add features - work that our Platform Engineering team doesn't have to do. Our organization has taken this same approach elsewhere by adopting a mix of SaaS Products (such as Datadog, Algolia, and Auth0) and open-source tools (such as Linkerd, Jaeger, and Spinnaker) to provide robust solutions to common needs. We still build software, but it is typically for stitching services together for an integrated, polished feel, such as Slack bots, CI/CD customizations, and abstractions to create a great developer experience.
Knowing when it makes sense to build vs. buy is challenging to me, especially making the business case for buying a solution when it makes the most sense. The bigger the platform team or group, the more bandwidth there will be for custom development work, but this will always be a critical skill for a Platform Engineer to grow.
Avoid Interrupt Overload
Another important way we scale our impact is by protecting our time, so we can get planned work done. We prevent our shared support channel from distracting us too much by having only a subset of our team responding to questions each week. There will be days when we are "fire fighting", but we have to be mindful when this happens and prioritize addressing any recurring issues. We also constantly work to clarify and communicate our team charter so people know which conversations to bring us into - otherwise, our time gets drained by meetings and "side quests" that we contribute little to, leaving us little room to deliver on our broader initiatives.
Making wise build vs. buy decisions and protecting our time are only two of many possible ways you can ensure you scale your impact. If you find yourself down in the weeds on some super-gnarly problem, it might be totally okay! Your work may unlock huge value for the business and be totally worth the effort. But don't forget to come up for air sometimes to sanity check what's happening with your team, so you can decide how best to proceed, together.
Summary
I feel confident you will have a solid foundation upon which to build an effective Platform Engineering team if you follow these four principles:
- Set goals on outcomes, not outputs
- Truly know your customer
- Invent on behalf of your customer
- Scale your impact
Using these, Platform Engineering really works. We are able to support our engineers in delivering their software fast and stably, and do so at a sustainable rate with a happy, engaged platform engineering team.
After all this talk about the importance of a product mindset, I wouldn't fault you for thinking I'm a product manager in engineer's clothing đŸ˜…. But really, I just want to make sure what I'm working on is impactful. The results are worth the effort.
I hope my talk has helped to shape your perspective on how Platform Engineering works and, if you're a Platform Engineer, I hope it will help you more effectively support your own engineering organization.