Burnout and the Interrupt Storm

Oct 19

Do you just need a break or a full system reset?

It’s Friday at 6pm, how is the week over already? I worked late every day, but what did I do exactly? I still desperately need to review that preliminary system design to unblock the team and schedule a review; the team has been waiting for days. The board meeting is next week, again already somehow. Putting slides together and updating our KPIs took a while. Something was odd about the most recent performance data too: the speed of the system is way up but the accuracy took a nose dive. I’ll have to investigate. We had two on-site interviews this week, and I still need to write my feedback. Maybe next week.

This is what it feels like to be caught in an interrupt storm of your own creation. Unless you’re an embedded systems development engineer, you’re unlikely to have heard the term before. Surprisingly, embedded software design best practice holds many valuable insights into how we can all lead more productive, successful, and lower burnout professional lives.

Much like the human mind [1], most simple embedded processors can’t actually do two or more things in parallel, they can only execute one thread of instructions at a time. Similarly, when humans try to multitask, evidence suggests we are merely changing the active task and its context at high speed, usually inefficiently and with poorer outcomes than if we were to do each in sequence [2]. Yet in the embedded world, just as in our professional lives, we often do need to do more than one thing at a time.

Anatomy of an Interrupt Storm

The main thread of an embedded program is generally executing our highest-level logic for achieving the overall system objective, but something may come up that demands immediate attention (e.g. while executing a motor movement profile, I might still need to receive a new communication packet containing the next set of motor instructions). Enter the interrupt service routine or ISR. Well-designed ISRs allow the system to briefly interrupt the main thread to handle critical tasks and they have a few key characteristics:

1. They are as short as possible

2. They access no other shared resources whenever possible

3. They are clearly prioritized

First, a good item to handle as an interruption is one that can be switched to, understood, addressed, then put back down very quickly. It takes time for both a human and a processor to switch context from one task to another, and constantly switching between long-running tasks is inefficient. Second, that switching cost can be further reduced if the interruption task context is as simple as possible and ensuring it doesn’t depend on the status of other tasks being handled by you or your team. Last, and most importantly, we must have a clear way of prioritizing interruptions. What is our embedded application to do if, while executing its main thread, a new communication packet comes in, but then, while breaking into an ISR to process that packet, a new digital input is triggered? Do we finish processing the packet or break into yet another ISR and handle the input trigger first (a nested interrupt)? A clear priority set must be established (usually called an interrupt vector table) to determine what can interrupt what and when.

Now we have a framework to handle any interruption while keeping the all-important main thread running smoothly. Problem solved, right? Certainly, these strategies are necessary, but they’re often not sufficient. There is still plenty that can go wrong. Despite keeping the ISR short, if a new higher priority interruption keeps occurring before we’ve finished processing the last, we might almost never be clear of all interruptions and get back to the main thread—an interrupt storm. Or worse, if our ISRs are too complex, too frequent, or poorly prioritized (such that one interruption always pre-empts another before either can finish), we’ll trigger a full system deadlock. Either way, the main thread grinds to a halt.

The processor may be executing instructions as quickly as it can, but from the external vantage point no progress is being made – sound familiar?

Dangers of Staying in the Storm

Back in 2020, as co-founder and technical team leader at Root AI, I experienced just such an interrupt storm, and it very nearly killed the business. Our company sought to create the first commercially viable autonomous robots which could physically interact with and harvest specialty crops like tomatoes, strawberries, and cucumbers. Toward the middle of our development arc, we hit a major obstacle.

The early generations of our system used an “open loop” approach, guiding the robotic arm via visual feedback to attempt each fruit pick as a sequential set of steps. This kept the architecture easy to debug and characterize for performance. While simple, this meant that, after the planning was done and a target selected, the robot was essentially blind-folded as it moved to pick the fruit and either secured, dropped, or missed it. Still, the open loop method took us far.

Harvesting accuracy metrics were decent with most attempts to pick a fruit succeeding, but overall harvesting speed was half that needed for an MVP. We knew then of emerging approaches for more responsive “closed loop” behaviors but, since we knew we only needed a 2X improvement in speed, the decision was made to focus instead on safer, more incremental improvements (faster vision models and imaging hardware, mechanical stabilizers, catches and chutes for collecting fruits, etc).

Sure enough, speed quickly started to increase and, confident we were on the right track, we continued on. Then, the interruptions started to multiply. Pressure mounted to productionize the system for manufacturing and build larger fleets of robots to support our lofty revenue goals in customer pilots. Expanded engagement meant we needed to invest more time traveling to execute pilots and reporting progress to our early adopters. We began hiring aggressively to support the ambitious development timeline. Patent filings constantly needed updates. Technical diligence calls for prospective investors were scheduled. Everything seemed so important. Nothing was negotiable. There was no “don’t do” on these items. We had to press on.

Amid all that, the ground shifted beneath our feet, and two things started to happen. First, data from our field trials began painting a troubling but very fuzzy picture: the faster we went, the less accurate the system became. Second, on the business side, operational constraints and other factors discovered through trials gradually started shifting assumptions in our techno-economics model. We didn’t need a 2X speed increase, we needed a 3X increase, then 4X, now 5X.

Five years later, as I look back on our blind-folded robot thrashing away at the chaotically swinging tomato vines it is painfully obvious what we needed to do. But by then, the interrupt storm had been raging for over a year. The worst part, every one of us in the team knew it and yet I still felt helpless to stop it. The idea of placing a full stop on development to step back, deeply analyze the data, rethink our robot’s basic theory of operation, and debate what to do next was unthinkable. Frankly, it is a testament to the capability and ingenuity of both the technical and business development teams at Root AI that we were able to brute force enough progress to secure our second round. Only with that funding secured did we take a long enough moment to see where change was needed, and we very nearly didn’t get that opportunity.

How to Break Out of the Storm

That feeling of helplessness to avoid disaster is what created burnout for myself and the team I was leading faster than any number of hours at a screen or in the lab. We mistakenly thought that, to manage that burnout, all we needed was a short break – a long weekend or week’s vacation. But without taking more drastic steps, the storm was bound to be worse when we returned. Thankfully, the same steps used to stop and correct interrupt storms and deadlocks in real embedded systems can also be used to stop them in professional life:

1. Shut Down the System

You can’t resolve the problem while the system is still running.

2. Identify The Sources of Deadlock

Determine which interruptions are occurring too often, taking too long, or are not prioritized well. Then employ common strategies to resolve each problem interruption.

2.1. Delete - It can’t interrupt you if you just don’t do it anymore.

2.2. Delegate - In embedded systems, when we have a critical function that absolutely must happen on time and with as little delay as possible (like monitoring a safety critical input, commanding a motor, or listening for time critical communications), we often add a second processor just to handle that function, a coprocessor. Too often, leaders corrupt delegation into a tool to offload low priority or unimportant tasks to those with less skill. Notice that embedded systems do the exact opposite. It is your most critical and sometimes highest priority recurring tasks that are most valuable to delegate, usually to someone trusted with more skill in that domain than yourself, not less. Only this type of delegation meaningfully reduces the “cognitive load” on the main thread, a product of the reassurance that a truly important function is in good hands and you no longer need to worry about it day to day.

2.3. Optimize - Implement new team processes or standards that reduce the frequency of the interruption or make each occurrence faster to resolve.

2.4. Re-Prioritize - Set clear guidelines for yourself on which interruptions get priority. What can interrupt the main thread? What can interrupt another interruption?

3. Reset the system state

If a particular set of conditions causes an interrupt storm, booting back up in the same state may just bring it back. First, it’s critical that you take the time to clear out outdated top-level goals, OKRs, project timelines, team org. structures, KPIs, and success criteria, then redefine them under the updated plan. Second, you need to reset your own state. In an embedded system it can sometimes be important to let the system sit unpowered for a bit before restarting. That lets any residual charges dissipate and ensure everything will start from a known state electrically when fired back up. People need this too. The break needs to be long enough to allow a full change in mental frame of reference and to be completely decoupled from work.

4. Restart the system

It’s important to resist the urge to dive back into the work before every earlier step has been completed. Starting the system back up by the proper boot sequence, ensuring all the critical team functions are clear and make sense under the new plan, is crucial to a successful restart.

Trust the Process

This type of full system reset can incur substantial down time and lost development velocity during implementation. From my experience making similarly transformative course corrections in my own engineering teams, it usually takes at least 2-3 weeks. Less than that, and it’s possible not enough time was spent on one of the steps above. Yes, this means team members may be without clear direction or even continue to work for a week or two on projects soon to be dramatically revised or even scrapped. In my case, that cost would have been trivial when compared with the cost of the entire organization charging forward down a doomed development path for months or years without correction.

The more overwhelmed you currently feel, the more valuable it is to pause and diagnose what poorly controlled interruptions might be contributing to that feeling, and what greater problems in the business they may be obscuring.

Ryan Knopf https://www.linkedin.com/in/rknopf/