Failures may be inevitable, but that doesn’t prevent success
David Tzemach
Posted On: November 24, 2022
10693 Views
16 Min Read
Despite your best efforts, your system could eventually fail to function properly. It happens in small startups and even in the biggest enterprises. It can be a minor mistake, such as a typo on a web page, Others, such as code that corrupts client data or an outage that prohibits consumer access, will be more serious. Some failures are referred to as bugs or defects, while others are referred to as incidents. The distinction isn’t really significant. In any case, once the dust has cleared and everything is back to normal, you must determine what went wrong and how you want to improve. This is called incident analysis.
The Elements of Failure
It’s easy to think of failure as a straightforward chain of cause and effect—A did this, which led to B, which led to C—but that’s not how it works. The truth, however, is that failure is the result of the entire development system in which work is carried out. (Your development system encompasses all aspects of software development, from tools to organizational structure. This is in contrast to your software system, which you are creating.) Every failure, no matter how minor, reveals information about the nature and vulnerabilities of the development system.
Failure is the outcome of a series of interconnected events. Small difficulties arise on a regular basis, but the system contains rules that keep them within a safe margin. A programmer makes a mistake that is off by one, but their teaming partner provides a test to identify it. An on-site customer makes a mistake while explaining a narrative, but the miscommunication is discovered through customer evaluation. A team member deletes a file by mistake, but continuous integration denies the commit.
Failure happens when several things go wrong at the same time, rather than a single reason. A programmer makes a mistake, and their paired partner was up all night and was too tired to notice, and the team is experimenting with less regular pair exchanges, and the canary server notifications were unintentionally disabled. Failure occurs when the development system—people, procedures, and the business environment—allows issues to combine.
As no failures occur, the team’s standards develop over time. For example, they may make partnering optional so that workers have more flexibility in their work methods. Their safe zones are shrinking. The failure circumstances eventually coalesce in precisely the right way to breach these smaller limitations, and a failure occurs.
It’s difficult to notice the trend toward failure. Each adjustment is little and improves some other aspects, such as time, efficiency, accessibility, or customer satisfaction. You must maintain awareness to avoid drift. Past achievement does not guarantee future success. Substantial failures may appear to be the consequence of large mistakes, but that is not how failure works. There is no one cause, nor is there any proportionality. Large failures are caused by the same systemic faults that cause minor failures. That’s excellent news since it suggests that minor setbacks serve as a “dress rehearsal” for major failures. They can teach you just as much as the larger ones do.
As a result, view each setback as a chance to learn and develop. A typo is still a failure. A problem discovered before to release is still a failure. If your team believes something is “done” but it subsequently requires repair, it is worthy of investigation. But it gets much more complicated. Failures, as I previously stated, are a result of your development system, but so are achievements. You can examine them as well.
Analyzing the Data
As part of an incident analysis session we are taking a collective look back at your development process with the goal of learning and developing. The five steps of a retrospective will thus be included in an efficient analysis:
- Set the stage
- Gather data
- Generate insights
- Decide what to do
- Closing
Include your whole team, as well as anybody else involved in the incident response, in the analysis. Avoid bringing executives and other spectators; you want participants to be liberated to speak up and acknowledge their mistakes freely, which necessitates restricting attendance to those who need to be present. If there is a lot of interest in the analysis, you may create an incident report.
The length of the analysis session is determined by the number of events preceding the incident. A complicated outage may consist of dozens of events and last many hours. A simple problem, on the other hand, may have only a few occurrences and require 30-60 minutes. With practice, you’ll get better.
A neutral facilitator should lead the session at the start and during tough situations. The more sensitive the occurrence, the more competent the facilitator must be.
Stage 1: Set the stage
Because incident analysis entails a critical examination of accomplishments and failures, it is crucial that every person would feel secure to engage, including having open talks about the decisions they made. As a result, begin the discussion by assuring everyone that the purpose is to leverage the event to better understand the development system of people, processes, expectations, environment, and tools. Instead of concentrating on the failure itself or attempting to assign blame, your goal in being here is to learn how to make your development system more resilient.
Assume good faith on the part of everyone engaged in the incident and ask everyone to certify that they can follow to that objective. When I facilitate this meeting, I make sure that I honestly believe that regardless of what we find, we must recognize and really believe that everyone did the best job they could given the circumstances, the information available at the time, their knowledge and skills, and the resources at their disposal.
In addition, I also make sure right on the start of the meeting that everything discussed during the analysis session is kept private (unless they approve otherwise). I also ask participants to agree not to repeat any personal information given in the session and ask them not to record the meeting.
Stage 2: Gather data
After you’ve established the stage, the following step is to figure out what happened. You will accomplish this by building an annotated, visual chronology of occurrences. At this point, people will be tempted to interpret the information but it is essential to keep everybody focused on “just the facts.” They’ll most likely require several reminders as the stage progresses. With the advantage of hindsight, it’s simple to criticize other people’s conduct, but that won’t help. A good analysis focuses on what individuals did and how your development system contributed to them achieving it, rather than what they could have done differently.
Begin by drawing a large horizontal line on your virtual whiteboard. If you’re doing the session in person, place blue tape on a large wall. Divide the timeline into columns that reflect different time periods. The columns don’t have to be consistent; weeks or months are frequently preferable for the earlier section of the timeline, while hours or days may be more appropriate for the moments leading up to the incident.
Help the participants to conduct simultaneous brainstorming to come up with events related to the occurrence. For example, “Service X returned wrong Exit code” or “Service was updated with new patch” or “DB wasn’t accessible after restarting the service”. You may use people’s names if they are present and give a consent. Make a point of documenting events that went well as well as those that did not. Software logs, incident response data, and version control data are all potential to be insightful assets. Put each event on a separate sticky note and paste it onto the board. For each event, use the same color sticky.
After that, ask everybody to take a step back and consider the big picture. What events aren’t included? Working in parallel, examine each occurrence and ask, “What came before this?” “What happened next?” Add another sticky note for each new event. You might find it useful to use arrows to demonstrate before/after relationships. Include events that involve humans, not just software. People’s choices have a big role in your development system.
Find and add previous events indicating how individuals participated in each event using automation that your team controls or utilizes. What role did automation perform? Configured? Programmed? Make an effort to maintain the tone of these events neutral and blame-free. Don’t speculate on what people should have done; instead, report what they did. For example, the event “Deploy script stops all DB instances,” for instance, might be preceded by the event “Engineer accidentally changes deploy script to stop all instances when no —target parameter found,” which are both preceded by “Team decides to clean up deploy script’s command-line processing.”
Events can have multiple predecessors feeding into the same event. Each predecessor can occur at different points in the timeline. For example, the event “Service A does not recognize -4 response code and crashes” could have three preceding events: “A Service A restart is done by…”, “Service A returns -4 response code” (Right before the crash); “Service A failed to start with error code 21”
Encourage people to give recollections of their thoughts and feelings at the time when events are added. You’re not here to lay blame, so don’t ask them to justify their behavior. Inquire as to what it was like to be present at the time the incident occurred. This will assist your team in understanding the social and organizational components of your development system—not only what decisions were taken, but why they were made.
Ask participants to put extra stickies in a different color for those ideas. For example, if Dany states, “I had issues about code quality, but I felt like I had to rush to meet our deadline,” he may make two sticky notes that read, ” Dany has concerns about code quality” and ” Dany thinks he needs to rush to meet the deadline.” Don’t guess on what others who aren’t there are thinking, but you can record something they stated at the time, such as “Stefany says she has problems remembering deploy script choices.”
Keep these notes focused on what individuals were feeling and thinking at the moment. Your purpose is to understand the process as it was, not to second-guess people. Finally, ask participants to underline key events in the timeline that seem most significant to the incident. Check to see if everyone has recorded all of their thoughts on the events.
Stage 3: Generate insights
It is now time to transform information into insights. At this point, you’ll mine your chronology for information about your development system. Allow people some time to study the board before you begin. This is an excellent time to take a rest.
Begin by reminding participants of the nature of the failure. Problems arise all the time, but they seldom coalesce in a way that leads to failure. The events in your timeline are not the cause of the failure; rather, they are a symptom of how your development system works. What you want to examine is that deeper system.
Examine the occurrences that you marked as significant during the “gather data” activity. Which of them have persons involved? To continue the example, you would select the events A Service A restart is done by…” or “New code was integrated by…” but not “Deploy script stops all Service instances,” because that event occurred automatically.
Each event involving people should be concurrently given one or more of the following categories. On a sticky note of a third color, list each category, then attach the note to the timeline.
- Feedback and dialogue: Involve information and conclusions from sources other than the event’s participants. For instance, assuming that a third-party service will never return an error code -21.
- Mental models and details: Incorporates decisions made by the team competing in the event during discussions. For instance, believing that a service that is managed by a team will never reply to the incident’s mistake. The team is confident that because they are the “ultimate” pros, it won’t happen to them.
- Attention: It requires the capacity to concentrate on pertinent information. For example, disregarding an alarm because multiple other warnings are occurring concurrently, or misinterpreting the significance of an alert owing to fatigue.
- User experience: Interactions with computer interfaces. For instance, giving a software the incorrect command-line parameter.
- Conflicting objectives: Choosing between several objectives, some of which may be undeclared. For example, choosing to achieve a deadline above increasing code quality.
- Procedural adaptation: This refers to instances in which the established method does not match the situation. For example, discarding a checklist once one of the stages returns an error.
- Use your imagination: If your event does not fall into any of the categories I’ve supplied, you can make your own.
The categories are also applicable to positive events. For example, “engineer develops backend to give safe fallback when service times out” is a “knowledge and mental models” event. After you’ve classified the events, take a minute to examine the big picture before breaking into small groups to debate each one. What does each one tell about your development system? And remember— Concentrate on the system, not the individuals.
For example, the event “Developer mistakenly modifies deployment script to halt all instances when no —target variable detected” appears to be an error on the engineer’s side. However, the timetable suggests that Jeff, the engineer in question, felt compelled to push to make a deadline, even if it meant sacrificing code quality. That suggests there was a “conflicting objectives” incident, and the true issue is how priorities are determined and conveyed. As team members talk about the event, they discover that they are all under pressure from sales and marketing to prioritize deadlines above code quality.
On the other hand, suppose the timeline analysis found that Jeff similarly misinterpreted the team’s command-line processing library’s behavior. That would also make it a “knowledge and mental models” occurrence, but you wouldn’t blame Jeff for it. Incident analysis always looks at the system as a whole, rather than at people. People are supposed to make errors. A deeper examination of the event in this example indicates that, while the team adopted Test-Driven Development (TDD) development and pairing for production code, it did not apply that standard to its scripts. The team lacked any means of preventing script errors, and it was just a matter of time until one slipped in.
After the breakout groups have had a chance to debate the events—you may wish to divide the events among the groups for efficiency, rather than having each group discuss every event—come together to discuss what you’ve learned about the system. Put each conclusion on a fourth color sticky note and place it next to the matching event on the timeline. Don’t offer any ideas just yet; instead, concentrate on what you’ve discovered. For example, “engineers are under pressure to sacrifice code quality,” and “deploy script needs a large and error-prone command line.” and so on.
Stage 4: Decide what to do
You’re ready to make a decision on how to improve your development system. You’ll accomplish this by brainstorming ideas and then selecting a number of your top options. Begin by going through the entire timetable once more. What changes should you make to your system to make it more durable? Consider all options without regard for feasibility. Simultaneously brainstorm on a board or on a new section of your virtual whiteboard. You are not required to connect your ideas to specific problems or events. Some will treat several issues at once. Consider the following questions:
- How could we avoid this kind of failure?
- How could we have identified this sort of failure earlier?
- How can we reduce the impact?
- How could we respond faster?
- Where did our safety net let us down?
- What more faults should we look into?
To proceed with the example, your team may come up with ideas like “stop committing to deadlines,” or “apply production coding standards to scripts” or “perform review of existing scripts for additional coding errors“, and so on. Some of these ideas are better than others, but you’re still creating them, not judging them.
Once you have a set of options, group them into these recommended options:
- Group 1: The team can implement the action.
- Group 2: The team cannot implement but can use another party to help.
- Group 3: The team has no power and minimal effect.
Have a brief discussion about the pros and cons. Then use voting method (as you like), to decide which options your team will pursue. You can choose more than one. Remember that you shouldn’t solve everything when you consider your options. Sometimes proposing a change adds more risk or expense than it eliminates.
Stage 5: Closing
Incident analysis can be time-consuming. Allow individuals to pause for a moment before returning to their usual duties. That breath might be symbolic, or you can really suggest that people stand up and take a deep breath.
Begin by determining what you want to preserve. For future reference, a screen capture or snapshot of the annotated timeline and other artifacts is likely to be useful. First, ask participants to go over the timeline and mark anything they don’t want published outside of the session. Before snapping the photo, remove the stickies.
Finally, show your gratitude to one another for your efforts. Give an example of the exercise: “(Name), I admire you for (reason).” Take a seat and wait. Others will also speak up. There’s no need to talk, but provide enough time in the end—a minute or two of silence—because individuals may take their time speaking up.
The “appreciation” activity might be uncomfortable for some people. A different practice might be for each participant to express a few words on how they feel now that the analysis is complete. You’re free to skip.
Closing
Agile teams recognize that mistakes are unavoidable. Mistakes are made by people, miscommunications happen, and ideas fail. Agile teams accept failure rather than embarking on a fruitless mission to prevent it. If failure is unavoidable, then it’s crucial to avoid it altogether, fail early so there is still time to recover, contain it so the repercussions are kept to a minimum, and learn from it rather than assign blame.
A good illustration of this attitude is continuous deployment. Monitoring is used by teams employing continuous deployment to detect problems. They deploy every half – hour, detecting problems early and mitigating their damage. They employ canary servers to mitigate the effects of failure, and they use each failure to learn about their limitations and improve. Surprisingly, accepting failure leads to less risk and better results.
Got Questions? Drop them on LambdaTest Community. Visit now