Somewhere a product manager is waking up, checking their email, and discovering their SaaS application had a major outage overnight.  He mutters to himself “Oh shit”.  His email is filling up with concerned emails from customer success managers, salespeople, and his boss. The operations team is frantically trying to resolve the outage. The issue might involve a new feature that was promoted to production last week.  As the product manager races to the office,  he debates in his mind “Did we allocate enough capacity to non-functional requirements in the last two Sprints?”  Once again he faces one of the major challenges of product management.  Do we invest in features or in preventing outages?  What is the right balance between them?

ThousandEyes

SaaS Outages Are Inevitable

Five Nines” or 99.999% is the platinum standard of availability for SaaS companies.  This translates to 5.25 minutes of downtime per year.  While this is expected for safety-critical systems (air traffic control, nuclear power plant control systems, etc.) most SaaS applications aspire to three to dour nines (99.9% to 99.99%) or 8.7 hours to 52.8 minutes of downtime a year.

There are two types of downtime. Planned downtime is scheduled to implement upgrades and configuration changes. Unplanned downtime is unexpected, due to circumstances such as software defects, systemwide failures and power outages.

Even the mighty Google has reported 148 outages across 45 key services in the past year.  BigQuery, a key service for both Google internal apps and customer apps, reported nine outages that totaled over 240 hours of downtime. That is only 97.2% availability.

The root cause of most outages is usually self-inflicted and can be traced back to how development chose to deal with non-functional versus functional requirements. While hardware and network connectivity do fail, it is a rare thing.  Redundant systems can usually take over.  A recent example of how not dealing with non-functional requirements promptly is the Kaseya ransomware outage.  

Alex Holden, founder and chief technology officer of Milwaukee-based cyber intelligence firm Hold Security. Holden said the 2015 vulnerability was present on Kaseya’s customer portal until Saturday afternoon, allowing him to download the site’s “web.config” file, a server component that often contains sensitive information such as usernames and passwords and the locations of key databases.

“It’s not like they forgot to patch something that Microsoft fixed years ago,” Holden said. “It’s a patch for their own software. And it’s not zero-day. It’s from 2015!”.  

Imagine being Kaseya’s VP of Product Management and having to explain to the CEO why your team did not prioritize fixing this issue over some new features.  That would not be a pleasant conversation.

Technology Adoption Life Cycle & Technical Debt

As systems age, the amount of technical debt rises.  Technical debt is the idea that certain necessary work gets delayed during the development of a software project to hit a deliverable or deadline. Technical debt is the coding you must do tomorrow because you took a shortcut to deliver the software today. By the time a product reaches the early majority stage of the technology adoption life cycle, the amount of technical debt can be staggering.  

As noted by Junade Ali in Mastering PHP Design Patterns:

“The cost of never paying down this technical debt is clear; eventually the cost to deliver functionality will become so slow that it is easy for a well-designed competitive software product to overtake the badly-designed software in terms of features. In my experience, badly designed software can also lead to a more stressed engineering workforce, in turn leading higher staff churn (which in turn affects costs and productivity when delivering features). Additionally, due to the complexity in a given codebase, the ability to accurately estimate work will also disappear. In cases where development agencies charge on a feature-to-feature basis, the profit margin for delivering code will eventually deteriorate.”

Product managers face a huge challenge making backlog prioritization decisions.  How much capacity should be allocated in a Sprint to new functionality versus technical debt and non-functional requirements?

Strategies to Conquer the Features vs Outage Conundrum

There are several strategies product managers can use to conquer the problem of how to prioritize features versus non-functional requirements.

Build Empathy for Development and Operations Teams

Product managers focus on defining “what” the market needs by defining user stories.  Development is responsible for determining how those stories will be technically implemented.  One of the biggest sins a product manager can commit is to not only define what the product should do and also dictate technically how it should be implemented.

Product managers need to develop an understanding of how their decisions impact the development and operations teams.  When outages occur, product managers often blame the development and operations team.  They are responsible for determining ‘how’ to implement backlog items

To conquer this natural bias, product managers should join outage investigations as silent observers.  There is nothing like getting woken up in the middle of the night to join a conference call for an outage. It is not their responsibility to resolve the issue and back-seat driving in a crisis is rarely appreciated by development or operations.  After ten early morning calls in a month, most product managers will develop a new appreciation of the impact of prioritizing features over non-functional requirements and technical debt.

Product managers should also participate in outage post-mortem investigations and root cause analyses.  Again, they should be silent observers.  The goal of these activities is to understand why an outage happened and what can be done to prevent it in the future.  Root cause analysis can identify high-priority improvements that must be made to ensure the stability of your product.

Start With an Honest Inventory of Technical Debt

You cannot fix the problem of technical debt until you define it.  Begin by building an honest inventory of technical debt.  There is no universally accepted approach to defining technical debt.  Technical debt can be best expressed as non-functional requirements.  Efrain (Frank) Velazquez presents an excellent approach in Handling Non-Functional Requirements in Agile Projects that combines user stories and acceptance criteria:

NFR: The system shall implement authentication and authorization functionality as per corporate security policy XYZ.

Functional Requirement:

As a system administrator (SA), I need to restrict user access to files so that authenticated users only have access to files for which they have access permissions.

Acceptance Criteria:

– The SA can grant specific authorized users access to specific libraries or documents.
– The SA can revoke access to specific libraries or documents from specific authorized users.
– All file access permissions for specific users are automatically revoked if the SA revokes their authorization to
access the system.

The task of creating an initial honest inventory of technical debt is daunting.  The development team should take the lead in writing user stories and acceptance criteria for non-functional requirements.  As Demond Tutu said, the best way to eat an elephant is one bite at a time.

Make Technical Debt Visible to the Organization

Next, make the extent of technical debt visible to the entire organization. There is constant demand for new features from executives, salespeople, and customers.  Technical debt and non-functional requirements constrain an organization from rapidly delivering new features.  Hard choices are made in each Sprint.  Development and Operations teams are naturally hesitant to expose the scale of technical debt.  They made the choices to incur the debt in order to meet some other critical goal.  At some point stakeholders need to understand the reality of the situation and the constraints it imposes.  A catastrophic outage like Kesaya incurred should not be the wake-up call stakeholders need to start looking at technical debt.

Make a Payment Plan

Technical debt should be treated like financial debt.  You pay interest on financial debt.  Almost all experts on technical debt recommend building and following a payment plan to reduce technical debt.  This means raising the priority of non-functional requirements in Sprint planning ceremonies.  Product managers should monitor the inventory of technical debt and make the hard decisions.

DevelopmentCorporate

Summary

Product managers have a tough job.  Balancing the priorities of functional and non-functional requirements is difficult.  Technical debt accumulates as a SaaS product moves through the technology adoption life cycle.  Technical debt constrains an organization’s ability to add new functionality to an application.  Sometimes it takes a catastrophic event like the Kaseya incident for an organization to wake up and deal with the issue.  Product managers do not have to choose between new features or outages.  Like paying off a college loan, principal and interest payments on technical debt must be made in each Sprint.


Also published on Medium.