Goals and Failure Modes for RFCs and Technical Design Documents

August 16, 2023
Rss Fetcher

Lessons learned from experience steering technical design collaboration

Image generated by Midjourney: A scroll depicting an architecture diagram

At a certain scale and ecosystem complexity, software engineers need a way to socialize their bigger design ideas outside of code review. Sometimes this starts as engineers just informally writing their ideas down in a doc and sharing them with their team. After a while, someone realizes everyone keeps asking the same questions, and then software design document templates are born.

I’ve spent a little time in this space at LinkedIn, but I was in charge of Twitter’s technical design process and associated design template artifacts (let’s call them RFCs) for quite a few years. I’ve spent a lot of time thinking about software design collaboration, and I’ll share some lessons from stewarding this kind of process and culture.

The earlier days at Twitter

Twitter’s RFC templates were modeled closely after Google’s, resulting from an influx of senior engineers from Google in the company’s earlier days. The backend RFC Google Doc template was a beast: over time, it had accumulated required sections from various “special interests” and was in the ballpark of 14 pages — before being filled in. Included sections asked the author to describe:

 - Multi-region design and site failover considerations
 - Compute and other resource requirements for backend services
 - Strategies for maintaining availability and graceful degradation
 - Testing and validation strategies
 - i8n concerns
 - Online and offline storage design consideration
 - API design
 - Runbook and operations information
 - Alignment with endorsed technology recommendations
 - And more

Twitter Engineering had a centralized design review forum with an arms-length relationship to the CTO and their advisory group of ICs for a while. Not everything went through this forum, arguably just the big-ish things, but the designs required explicit sign-off from a senior engineer who was an architectural authority in the space.

The collection and organization of engineering-wide RFCs and their review were partially supported by program management and partially by a loose collection of engineers.

Evolution of the process

I eventually became the de-facto owner of the RFC process. Over time, I rolled out a series of changes, which federated some of the design review structures around engineering, and cleaned up the RFC template. I organized a small committee of owners for the process, which was used to update the artifacts and process as necessary incrementally.

I also refined the role of the senior IC RFC review group to design outcome facilitators rather than design approvers. Instead of directly approving the designs, their role was to organize and steer the purpose-built design review committees, ensuring the conversation was moving forward constructively and efficiently (in theory).

Usage of the templates and engagement with the process generally seemed pretty high; these RFCs became one of the very few nearly-ubiquitous artifacts across engineering.

Adherence to this process was fairly consistent, but one thing never materialized: sufficient resources or attention to ensure that the process was efficient and effective for our developers and their time. We didn’t really track productivity metrics around RFC usage (or completion) or gather continuous feedback from our engineers about their ideas and experience with design reviews. Given the time investment typically associated with RFC development, we should have.

A few years later, for various reasons, we took another look at how RFCs were being used.

This time, we got data

I orchestrated a series of engineering interviews with a combination of 50+ RFC authors, reviewers, design facilitators, junior to principal engineers, and some cross-functional partners. Our RFC / SDLC team ran these interviews. We set out to understand a series of questions:

What were the most frustrating and useful parts of the RFC process?
What kinds of things were more effective to uncover during design review versus in code review?
How could we engage and motivate authors and reviewers in a way that reduced design turnaround time?
If we could do anything, what would the design process look like?

We got a lot of feedback — both positive and negative. We came away with a synthesized set of interesting lessons, a few of which are summarized below.

Image generated by Midjourney: technical puzzle

First, authors and reviewers felt that most of the RFC template was superfluous. The level of detail in most docs was too high, and much of the content was better suited for code rather than doc review. For example, API or schema design or library usage.

There’s a set of valuable, common questions to ask every author of a system under design, which helps clarify the circumstances of what’s being built:

- Problem statement / why do we need to build something?
- Goals of the system presented in design; non-goals which are out of scope
- What success looks like, and how it can be measured
- High-level overview of solution proposal
- Details
- Alternatives considered
- Risks

Basically, everything else could be moved to an accompanying checklist. Authors could quickly knock that out, and reviewers could use it to assess to see if their concerns were being addressed efficiently. Details about code and schema should be moved into code review tools.

Second, it turned out that the design facilitators didn’t actually like their role. This cohort of engineers tended to be more senior and more platformy, with strong opinions about how systems should be built easily and safely. Having a role-based structure around RFC review was good, but they felt that much of the facilitation work could be handled by program management. This included steering async review, convening the review committees, or helping manage stakeholders. Instead, they wanted to focus their time on partnering with authors to refine their designs.

Third, the RFC authors felt the experience and knowledge gained by running an RFC through a design review, with its structured thought process and senior engineering facilitator, was invaluable. Although it could be lengthy, it helped them build relationships across Engineering, learn how to use novel approaches or systems supporting their work and understand how the stack worked end-to-end.

For reviewers, often from platform or supporting teams, it helped them understand where their customer teams were going and what kinds of problems they were trying to solve. In this sense, RFCs, in aggregate, acted as a proxy for customer team strategy.

Lastly, RFC reviewers wanted a better understanding of what kinds of assertions or behaviors in the design were being dictated by a corresponding product spec versus being made by the RFC author themselves. This would help avoid arguments over product definitions and tradeoffs in the RFC amongst the reviewers in Engineering. Those discussions are better to have directly with the product manager or requirements owner.

Regrettably, we could have gathered this feedback years earlier. Based on what we learned, we made a series of changes to the RFC materials and process. After aligning with our xfn stakeholders, we changed to positive reception.

Takeaways and Lessons

There’s a set of considerations that can help make RFC templates and processes simultaneously more practical and ergonomic for design and review.

RFCs are tools — not outcomes

RFCs are tools for supporting better outcomes. They can help you reach those outcomes through a more maintainable design, a de-risked delivery plan — or realizing that you might not have to build anything. They also help teams understand gaps in capabilities provided by a technology stack and what should be done to close them. These things should be the focus of the process — not the RFC itself.

RFCs themselves deliver little intrinsic value. They account for a snapshot of design considerations at a time, support communication and shared understanding, and support the nemawashi of those considerations within the system in question. If all of these things could be captured in code review for simpler changes, then the RFC is just overhead.

The amount of effort invested in any given RFC should be proportional to its likely improvement of outcomes, manifesting primarily through better design — simpler, more efficient, and less risk.

Use gating processes carefully

It is far more important that an RFC get written for appropriate projects and reviewed by someone than not written to avoid some procedural gating function. For the author, even laying out a design and plan “on paper” can help clarify the path forward and logically break up a project into sensible deliverables and milestones.

Some designs necessitate critical, blocking review by senior engineers, cross-functional partners, or both. Every company should have a process for determining what these are. When implementation risks of poor design far outweigh the cost of heavyweight review, spend the balance of time in design de-risking.

In many cases, this is understanding the difference between reversible and irreversible decisions.

Laundry lists of concerns in RFC templates indicate dysfunction

RFCs are things where symptoms of organizational dysfunction commonly manifest:

Ambiguous or non-existent product or customer requirements
Ambiguous team or ownership structure for implementations
Lack of standards or opinionated technology usage
Lack of clarity about architectural layering and abstraction in the ecosystem

If these problems tie up engineers in RFC reviews, they should be addressed at the level of the engineering team or the company.

Sometimes, all kinds of technical scale, capacity, and idiosyncratic details of internal platform usage are demanded within RFCs, even for simple projects or product systems. This is a strong indication that your software ecosystem needs to be simplified. It’s offloading concerns to developers — causing an increase in the scope and complexity of their designs. They’re being made to account for the things the org should be provided to them.

Put checklists in an actual checklist

RFC authors want the flexibility to express their designs in ways that feel intuitive to them. But if your design process and RFC templates are a series of yes/no sections such as:

Did you do X
Did you incorporate Y
Did you consult Z

This is much better represented as a checklist, which allows you to build tools around these binary answers. It’s also much easier to evaluate whether a checklist contains the end-to-end “must do’s” you’d like every design to think through.

I’ve found most authors like checklists because they don’t want to miss anything important. The problem is, they also don’t want to spend time cargo-culting written prose about how they went about doing X or consulting Z.

The RFC reviewer is a constituency involved in this process, too

It’s easy to over-index the role of the author of the RFC and realize that there are usually an order of magnitude more reviewers than authors involved in this process.

Efficient RFC review is supported by reviewers understanding their responsibilities in the RFC review context. It’s further improved by “no surprises” — reviewers know what to look for and how to understand it. Ensuring a minimally-disruptive and focused role for specialized reviewers makes it easier to secure their consultation in the future.

Reviewers have obligations, too: much like code review, their job is to offer constructive feedback that improves the overall design in a timely fashion. Nitpicking and sluggish review slow the entire process down and cost everybody involved. Standards for review conduct should be established.

This stuff is important; it needs to be owned and improved

If most of your engineers engage in an RFC-like design process, which steers how the big technical ideas at the company are formulated and reviewed: it needs to be properly owned.

Engineers don’t like putting up with inadequate code review tools or processes; RFCs aren’t different. These processes should be measured so they can be iteratively improved and better support engineers in their design work. This ultimately helps deliver the company’s better, more time-efficient design outcomes.

Goals and Failure Modes for RFCs and Technical Design Documents was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.