The Andon Cord in Software Teams
Wednesday, day six of a two-week sprint. An engineer on a self-organized team of eight is ninety minutes into a hairy migration script and no closer to a fix than at the eighty-minute mark. They know it. Nobody else on the team does yet.
What happens in the next sixty seconds is not a question of process. It is a question of whether they say the thing they already know, and whether the team responds when they do.
That is what the Andon Cord is actually about. Not the rope. Not the ticket. A single anomaly, surfaced before it has a chance to compound with three others into a release that nobody can deliver.
What the Andon Cord actually is
The story starts in 1896, in a weaving shed in central Japan. A power loom hums steadily through a run of cloth. A single thread snaps somewhere in the warp. The loom stops itself. There is no operator standing over it. There is no foreman shouting. The machine knows it is broken and refuses to keep producing flawed cloth until a human comes to look.
Sakichi Toyoda (the inventor whose company would later become Toyota) had built that loom. He called the underlying principle jidoka, usually translated as “automation with a human touch,” but more precisely “intelligent automation” or “autonomation.” A machine that knows when to stop is doing more work than a machine that just runs (Vorne, Wikipedia: Sakichi Toyoda).
Half a century later, Taiichi Ohno extended the principle from a single machine to an entire production line. The mechanism was a literal cord that ran along the assembly floor. Any worker who saw a defect could pull it. Pulling the cord did not immediately stop the line; it triggered an Andon signal (the word means “paper lantern”), brought a team leader to the station, and started a fixed-window response. If the problem could be solved before the workstation reached the end of its zone, the line kept moving. If not, the line stopped, and the entire system swarmed the defect (IT Revolution: The Andon Cord, IOSH: History of the Andon Cord).
Jeffrey Liker codified this as Principle 5 of the Toyota Way: Build a culture of stopping to fix problems, to get quality right the first time (ActioGlobal). The cord is the visible tool. The principle is the culture.
The most important and most-skipped fact about the Andon Cord at Toyota is that pulling it is expected. A common cited number from senior Toyota leaders is roughly a thousand pulls per shift across a single plant. When the rate dropped from ~1000 to ~700 at one Toyota factory, the plant manager called an all-hands, not to celebrate but to ask what had stopped working, because fewer pulls meant less learning (Gemba Academy). That inversion is the whole point. The cord is not a failure metric. It is a kaizen signal. High pull rate, fast resolution, no blame.
The lessons people skip
Two pieces of history get omitted when this concept gets imported into software, and both of them matter.
NUMMI. In 1984, Toyota and General Motors reopened GM’s Fremont, California plant as a joint venture (New United Motor Manufacturing, Inc.). They kept the same workforce GM had once called the worst in the company: 20% absenteeism, wildcat strikes, cars rolling off the line so broken that they had to be towed away for repairs. Toyota installed the same Andon system they used in Japan and trained the same people on it.
Within a year, those same workers were matching Toyota’s home-plant productivity and quality. Absenteeism fell below 3%. The single change that made the difference was that management kept its promise: when a worker pulled the cord, leadership swarmed to help, did not punish, and treated the pull as information (MIT Sloan: How to Change a Culture (Lessons from NUMMI), Lean Blog: Before Just Culture, There Was NUMMI). The workforce was never the problem. The system above them was.
When American auto plants in the 1980s tried to copy the cord without the culture, the cords were installed and almost never pulled. Workers had learned, correctly, that surfacing problems got them blamed. The mechanism was identical. The result was nothing.
Spear and Bowen. In 1999, Steven Spear and H. Kent Bowen published “Decoding the DNA of the Toyota Production System” in Harvard Business Review after four years studying more than 40 plants. Their finding was that the visible practices of TPS (Andon, kanban, takt time) were not the system. The system was four implicit rules: every activity is highly specified, every connection is direct and unambiguous, every pathway is simple, and every improvement is run as an experiment using the scientific method (HBR: Decoding the DNA). The Andon Cord works because the rules underneath it work. Without those rules, pulling the cord just creates noise.
These two lessons (the cord requires culture and the cord requires rigor underneath) are the load-bearing context
for everything that follows. The same three failure modes kill every Andon system, software or steel: the culture
punishes pulling, the rules underneath are absent, or the follow-through never happens. The mechanism is rarely what
fails. The system around the mechanism is. Skip any of those three, and you end up with a Slack channel called #andon
that nobody posts in.
Where the metaphor actually fits in software
A car door is a car door. A user story is not a user story. The translation is not a one-to-one map.
The first instinct (“the Andon Cord in software is the broken build”) is correct but incomplete. CI failure is one signal. There are several, and a real Andon system in a software team has a small number of clearly defined cords, not one fuzzy concept. Here is the working set, roughly ordered from cheapest to most disruptive:
| Cord | What “pulling” looks like | What stops |
|---|---|---|
| Broken main | Push that fails CI on main | New merges to main until green |
| Failed canary or rollback | Production deploy fails health check | Further deploys; on-call engages |
| Error budget freeze | SLO burned for the rolling window | Feature work; team focuses on reliability |
| Personal “I’m stuck” | Engineer 60-90 min into a wall, no progress | That engineer’s individual work; team swarms or pairs |
| Story-level signal | Story has aged past WIP age limit, scope is wrong, dependency surfaced mid-flight | Pulling the next ticket; team triages the in-flight one |
| Sprint-level Andon | Sprint goal is no longer reachable as scoped | Sprint as currently committed; replan, do not power through |
| Cross-team Andon | Upstream dependency is degraded; downstream teams blocked | Coordinated stop; incident commander or equivalent owns response |
The cheap cords (broken main, personal stuck signal) should be pulled all day, every day. The expensive cords (sprint termination, error budget freeze) are rare and serious. A team that never pulls the cheap cords is suppressing learning. A team that frequently pulls the expensive cords has a planning problem the cord is not going to fix.
The unit being protected here is the integrity of the delivery pipeline and the team’s commitments. Not “the build.” Not “the ticket.” The whole thing.
The scenario, played through
Back to the team on Wednesday morning.
Sam, the agile coach. Seven ICs and Sam. She is not a manager. She has no authority to fire anyone, deny a PTO request, or change a deadline. What she owns is the team’s system: the rituals, the visible flow, the impediments log, and the question nobody else is paid to ask. Is this still the right plan, and if not, who needs to know? In a healthy team that person is, functionally, the cord-keeper. She does not pull every cord. She makes sure cords get pulled when they need to be, and that pulls trigger a real response.
Here is how the morning unfolds in a team that has installed the system properly.
08:50. The engineer from the opening sends one Slack message: Andon - migration script - 90m no progress, want a pair. A senior engineer reacts, hops on a Zoom, and they fix it in twenty minutes. The cord here is the personal one.
The cost of the pull is one Slack message. The cost of not pulling, at this exact moment, is the start of the
compound. By 14:00 it would have been one of three problems wobbling silently at once. At 08:50 it is just one.
09:30. The CI light goes red on main after a merge. Someone has already pushed a follow-up that depends on the
broken commit. The team’s policy (pre-agreed, written down) is: when main breaks, no further merges, the breaking
author owns the fix or the revert, the next-most-recent merger holds their PR. The revert lands in eight minutes. The
team did not have a meeting about it. The rule was already there.
10:15. The flaky test surfaces and a senior engineer pulls up CloudWatch. The 502s are a thin red line on the dashboard, small but consistent, and they line up with the failures. This is the cross-team cord. The team pages the platform on-call with one sentence, one dashboard link, and the affected endpoints. The platform team owns the response. The team that found it does not own the fix. They own the signal.
14:00. Two stories committed at 5 points each are now eight days into the sprint with no end in sight. Sam asks the team, in standup language: is this still the right plan? The team agrees it is not. They pull the sprint-level cord: the two stories are de-scoped, broken up, the committed work is reset to what is actually deliverable, and the leftover scope moves to the next sprint with the unknowns made explicit. Nobody is in trouble. The forecast was wrong. The forecast is now right.
What did not happen: the team did not power through. Nobody worked late “to make it up.” The newer engineers were not blamed for the migration time. The platform team did not get a heated message. Sam did not “manage” anyone. The system did the work, because the system existed before it was needed.
This is the model. Several cords, clearly defined, owned by the team, supported by management, used often.
The escalation pattern
Escalation in an Andon system is not the same as escalation in a ticketing system. The point is not to get the problem to someone with more authority. The point is to get the right help fast and to keep going. The right shape is short, layered, and predictable.
A working escalation pattern for a 6-10 person agile team looks roughly like this:
- Self-correct (0-15 min). The engineer notices and resolves it themselves. No cord pulled. This is the highest-leverage path. The criterion for moving on is time-boxed, not did I figure it out yet.
- Peer help (15-90 min). Pull the personal cord. Slack the team. Pair. The criterion for stopping the time-box: real progress, not “I think I see it now.”
- Team swarm (within the day). The cord went out wider. Two or three people are now on the problem. The Scrum Master / agile coach knows. The story may be re-scoped. The work-in-progress for the rest of the team holds.
- Cross-team / on-call (within the hour, often immediate). A dependency is broken or an incident is forming. The team’s job is to raise the signal, not to fix systems they do not own. The on-call rotation, an incident commander, or the platform team owns the response.
- Leadership (rare, defined trigger). Sprint goal will miss. SLO will burn. A regulatory deadline is in jeopardy. A teammate is in distress. The agile coach surfaces it. The leader’s job is to remove organizational impediments, defer dependent commitments, and protect the team. Not to take over the problem.
- Postmortem (within the week, blameless). Every non-trivial pull above level 2 generates a learning artifact. Not a witch hunt. A change to the system (a missing test, a missing alert, a missing runbook, a missing skill, a missing decision) that makes the next instance of this problem either impossible or cheap.
The Scrum Patterns community formalized a piece of this as the Emergency Procedure and Sprint Termination patterns: if the sprint goal is unreachable, terminate and replan rather than continue working a plan you know is wrong (Scrum Patterns: Emergency Procedure). This is the expensive cord. Pull it when the data says to. Do not let pride keep the line moving.
The two failure modes to avoid: cords that escalate but never resolve (the issue lives in a Slack channel and ages), and cords that resolve without learning (the issue is fixed, the system is unchanged, the next instance plays out identically). The pull is the start of the work, not the end.
What the research supports
The Andon Cord is a 130-year-old idea, and the science that makes it work in software is mostly indirect; a lot of it lives under names like psychological safety, generative culture, trunk-based development, and error budgets. Here is the evidence base, with honest strength labels.
Strong evidence (replicated across multiple studies and years):
-
Generative organizational culture predicts software delivery performance. Ron Westrum’s typology (pathological, bureaucratic, generative) was originally developed for medical and aviation safety. Forsgren, Humble, and Kim showed in Accelerate and the State of DevOps reports that generative culture is one of the most durable predictors of both delivery performance and organizational outcomes. The mechanism that matters here: in generative cultures, messengers are trained, risks are shared, failure leads to inquiry. That is a cord-pull culture in the language of organizational psychology (DORA: Generative Organizational Culture, IT Revolution: Westrum’s Model in Tech Orgs).
-
Trunk-based development with small frequent merges outperforms long-lived branches. DORA’s longitudinal data finds that teams with three or fewer active branches and at-least-daily integration significantly outperform on all four delivery metrics. This is the technical substrate of the broken-main cord: if you do not integrate frequently, you do not have a meaningful build to keep green (DORA: Trunk-based development).
-
Speed and stability are not a tradeoff. The most cited DORA finding, replicated every year from 2014 through 2025. Elite performers achieve both simultaneously. Stop-the-line discipline is part of how they get there: you are not slowing yourself down by halting on quality signals; you are removing the rework cycle that the alternative produces (Accelerate).
-
Loosely coupled architecture is the highest-impact technical capability. DORA 2017-2023. Tight coupling increases the cost of any single cord-pull because the blast radius is larger. Loosely coupled systems make local Andon cheap; tightly coupled systems make every pull expensive (DORA capabilities).
Moderate evidence (supported but with caveats):
-
Psychological safety enables the speak-up behaviors that make Andon work. Amy Edmondson’s foundational 1999 ASQ paper established the construct in cross-functional teams. The 2024 mixed-methods study (Wijayanto et al., Empirical Software Engineering) extended it specifically to agile software teams using twenty interviews and a 423-respondent survey, finding that psychological safety produces social enablers that advance team quality outcomes (Edmondson 1999, ASQ, Empirical Software Engineering 2024). The caveat: psychological safety is correlational with quality, not proven causal, and the construct is sometimes mismeasured (Edmondson herself has flagged this in 2024 interviews). Strong direction, soft causation.
-
Error budget freezes work as a “stop the line” mechanism in SRE practice. Google’s published policy is: if a service exceeds its error budget for the rolling four-week window, halt all changes and releases except P0/security fixes until reliability is restored. The intent is explicit in the policy text: this is not punishment, it is “permission to focus exclusively on reliability when the data says reliability matters more than features” (Google SRE: Error Budget Policy). The evidence is largely Google’s own and high-credibility practitioner adoption. Replication outside hyperscale is sparse.
-
Documentation is a force multiplier. DORA 2021-2023 found high-quality docs make trunk-based development up to 12.8x more impactful and meaningfully amplify CI/CD adoption. This matters for Andon specifically because the response to a pull depends on the runbook existing. A cord without a runbook is theater (DORA 2023).
Industry data (practitioner-led, not formally replicated):
-
Pull rate is a kaizen signal, not a failure metric. Toyota’s own internal practice (~1000 pulls per shift, with concern when the rate drops) is widely reported but not the subject of independent academic replication. Software-team practitioners have reported similar patterns: in one published Agile DC case study, a development team’s average daily Andon pulls rose to 1.27 per day while their cycle time dropped 82% (AgileDC 2019, Gemba Academy). The directional finding is consistent across reports. The number is not load-bearing.
-
Amazon’s “Andon Cord” customer-service mechanism scaled the principle into a software-driven operation. Bezos cited it explicitly in shareholder letters; the practice was that any customer service rep could pull a product from the site if they detected a defect pattern, and the retail org had to fix or stay pulled (Lean Enterprise Institute, SixSigma.us: Customer Service Andon). It is a real implementation, and an important reminder that the cord is most powerful when the people closest to the customer hold it.
Weak evidence (claimed but thinly supported):
-
There is a single right way to wire an Andon system into a Scrum team. No published study identifies one. The concrete mechanisms (broken main policy, error budget policy, sprint termination, swarm rules) are well-documented individually. The integrated pattern is mostly experience reports.
-
The Andon Cord, as such, “causes” psychological safety. The relationship runs the other direction. Generative culture and psychological safety make cord-pulls possible. Installing a cord into a culture that does not have those properties produces the GM-1980s outcome: the rope hangs, nobody touches it.
Best practices
The cord is the cheap part. What follows is the rest of the system.
Define what counts as a cord, before you need one. Write it down. Broken main. Personal stuck signal at 90 minutes. Story age past the WIP limit. Sprint goal at risk. Production deploy failed. Dependency degraded. Six items max. If your team cannot recite them in standup, you do not have a system.
Define who can pull each one. For most cords, the answer is “anyone on the team.” For sprint termination and error budget freeze, the answer is more nuanced: the agile coach surfaces, the team decides, leadership is informed. Either way, write it down.
Define what “stop” means. Stopping main means no merges, not “merge with care.” Stopping the sprint means replanning, not “let’s see how the next two days go.” Stopping for an error budget means the feature backlog freezes, not “we’ll talk about it.” Ambiguity here is the entire game. The cord that means “we should probably slow down a little” is no cord at all.
Define who responds, and how fast. Personal cord: the team, within the workday. Build cord: the merger, immediately. Cross-team cord: the on-call, paged. Sprint cord: the team, in the next ceremony. Each pull has a named response, not a hope.
Practice it before you need it. Run a build-break drill. Run a sprint-termination tabletop. Read the runbook for the error-budget freeze before you are in one. The expensive cords get pulled rarely. That is exactly why the team will fumble them when the moment comes, unless they have practiced.
Track pull frequency as a leading indicator. A team with zero pulls per sprint is not a healthy team. It is a team suppressing signal. The right shape is many cheap pulls (peer help, build holds), few expensive pulls (sprint terminations, error budget freezes), and a learning artifact for each non-trivial one. Going up on cheap pulls is good. Going up on expensive pulls means the planning system is broken, not the team.
Run the postmortem blameless and the action items real. Blameless postmortems are not “we don’t talk about who did it.” They are “we treat the system, not the individual, as the unit of correction” (Rootly: Blameless Postmortems). Every non-trivial pull deserves one. Every postmortem deserves at least one action item that changes the system. If postmortems consistently produce no system changes, they are theater and should be killed.
Protect the cord-pullers, especially the new ones. The first time a junior engineer pulls a cord and management responds with curiosity and help, you have built the team. The first time they pull and get blamed, you have lost it, and probably permanently. Senior engineers and the agile coach should pull cords visibly and often, especially the personal one, especially in front of new hires. Modeling the behavior is faster than teaching it.
Anti-patterns to recognize
A few patterns recur across teams that have an Andon system on paper and no Andon culture in practice.
The cord is decorative. A #andon Slack channel exists. It has eleven members and three messages, all from the
agile coach. The team has never pulled. This is the GM-Fremont-1980 condition. The mechanism is installed. The culture
isn’t.
The cord triggers a meeting, not a response. Pulling generates a calendar invite for “next Thursday’s retro discussion.” The signal goes cold. Cord pulls need synchronous response or they are not cords.
The cord punishes the puller. “Why didn’t you figure it out yourself?” “Why didn’t you escalate sooner?” Either question, asked seriously, kills the system. The next pull will be later or never.
Every problem is a cord pull. Inverse failure mode. The cord becomes a way to dodge the work of self-correction, peer help, or basic problem-solving. The cord is for things that actually warrant stopping the line, not for “I don’t feel like reading the docs.”
The cord exists, the runbook does not. Pulling for an error budget freeze with no documented response plan, no defined freeze duration, no decision-maker, no exit criteria. The pull happens. Nothing meaningful follows. The cord teaches the team the cord is fake.
Cord pulls do not change the system. The same problem surfaces, gets swarmed, gets resolved, surfaces again next sprint. The pull was a workaround, not an investment. Without learning artifacts and follow-through, the cord is a treadmill.
A morning where nobody pulled
Put the same Wednesday under a culture where the cords exist on paper and stay there. Watch the day.
08:50. The engineer is ninety minutes into the migration. They do not send the message. They keep grinding. By 11:00 the script is still broken, the engineer is quietly demoralized, and a teammate is starting to wonder, in the back of their head, why the work seems stuck.
09:30. Someone breaks main. There is no written policy, so the next merger pushes anyway because their PR is “a
small change.” By noon the build has been red for two and a half hours, three other PRs are queued behind it, and
nobody is sure whose turn it is to fix what.
10:15. A senior engineer spots the 502s on the upstream service. Four percent feels small. Platform is “always a pain to talk to.” It can wait. Later does not arrive. By 14:00 customer ops is paging.
14:00. Two stories committed at 5 points each are now nine days into a two-week sprint with no end in sight. Nobody says it. The team works late “to make it up.” Two engineers skip lunch the next day. The release ships short. The retrospective will call it a “rough sprint” and resolve nothing.
The mechanism was on the wall the whole time. Every individual cord-pull moment came and went without one. The compound is not a bug; it is what not pulling produces, deterministically, every time.
That is the deeper anti-pattern. Not any single missed pull, but the silent agreement that small signals are not worth surfacing. The cost is never paid at the moment of the missed pull. It is paid hours or days later, all at once, in a release that nobody can deliver.
Monday morning actions
Concrete, by role.
If you’re an IC.
- Pull the personal cord earlier than feels comfortable. Ninety minutes stuck without progress is the line, not three hours. The team’s expected behavior is help, not judgment.
- When you break main, you own the revert. Same workday, ideally same hour. Revert first, fix in a follow-up PR. Do not negotiate.
- Surface signals for systems you do not own. If you spotted the upstream 502s, the platform team’s on-call needs to know in one Slack message and one dashboard link. Your job is the signal. Their job is the fix.
If you’re a team lead or agile coach.
- Write the cord list down. Six items. Put them in the team’s working agreement. Read them in retro until everyone can recite them.
- Pull a cord visibly in front of the team this sprint. Especially the personal one. Especially if you are senior. The fastest way to install the behavior is to model it.
- When the sprint goal is wobbling, name it. Do not power through. The Scrum Patterns Emergency Procedure exists for a reason. Re-plan with the team, do not extract the work from individuals.
- Track pull frequency as a leading indicator, not a defect rate. A team that pulled 14 times this sprint and learned 14 things is healthier than a team that pulled twice and shipped silently.
If you’re a VP, director, or sponsor.
- Underwrite the cord. Out loud. Tell the team, in writing, that pulling is expected and that the response is help, not blame. Then prove it the first time it costs you something. The first pull that costs you a date is the most important moment in the system’s life.
- Fund the runbooks. The cord without runbooks is theater. Treat documentation as infrastructure, not overhead. DORA’s 12.8x multiplier on TBD impact is documentation work in a trench coat.
- Invest in loosely coupled architecture. It is the difference between a cord-pull that affects one team for an hour and one that affects six teams for a day. The blast radius is an architectural choice.
- Adopt error budgets if you have an SLO posture, and a sprint termination policy if you do not. Pick the one that matches your operating model. Either gives the team a predictable, leadership-supported “stop” mechanism that is not a meeting.
The honest summary
The Andon Cord is one of the most-quoted ideas in lean software thinking and one of the least-installed. The reason is consistent: the cord is cheap and the culture underneath it is not.
For a 6-10 person agile team, the practical move is small. Define a handful of cords. Name who pulls and who responds. Run the postmortems blameless and the action items real. Pull a cord visibly this week. Track pull rate as a kaizen signal, not a defect rate.
A loom built in 1896 stopped itself when a thread broke. A workforce written off as the worst in GM matched Toyota’s home plant in a year. A team of eight on a Wednesday morning does or does not say the thing they all already know. The mechanism is older than software. The decision is new every time.
The cord that nobody pulls is not a sign of a team without problems. It is a sign of a team without trust. That is the harder thing to build, and the only thing that makes the rest of this work.
Sources
- Spear, Steven, and H. Kent Bowen. “Decoding the DNA of the Toyota Production System.” Harvard Business Review, September-October 1999. Four-year study of more than 40 plants. The four implicit rules underneath Andon.
- Liker, Jeffrey K. The Toyota Way. McGraw-Hill, 2004. Principle 5: build a culture of stopping to fix problems. (summary)
- Ohno, Taiichi. Toyota Production System: Beyond Large-Scale Production. Productivity Press, 1988. The original mechanism description from the architect of TPS.
- Forsgren, Nicole, Jez Humble, and Gene Kim. Accelerate. IT Revolution, 2018. The DORA research foundations: generative culture, trunk-based development, speed and stability together.
- DORA: Generative Organizational Culture. Westrum’s typology integrated into the DORA capabilities model.
- DORA: Trunk-based Development. Longitudinal evidence for short-lived branches and frequent integration.
- Edmondson, Amy. “Psychological Safety and Learning Behavior in Work Teams.” Administrative Science Quarterly, 1999. The foundational paper.
- Wijayanto et al. “The role of psychological safety in promoting software quality in agile teams.” Empirical Software Engineering, 2024. Mixed-methods study of agile software teams: 20 interviews + 423-respondent survey.
- Google SRE Workbook: Error Budget Policy. The published policy for halting changes when reliability is in deficit, a software-native Andon.
- Scrum Patterns: Emergency Procedure. The pattern formalizing sprint termination as a stop-the-line response.
- Rootly: How to Run Effective Blameless Postmortems. Practical mechanics of the learning artifact that follows a non-trivial cord-pull.
- MIT Sloan: How to Change a Culture (Lessons from NUMMI). The Fremont turnaround: same workforce, different culture, same Andon Cord.
- Lean Enterprise Institute: How Lean is Amazon? Bezos’s customer-service Andon and its operational implementation.
- IT Revolution: The Andon Cord (John Willis). Practitioner essay tracing the line from Toyoda’s 1896 loom to modern DevOps.
- Gemba Academy: How Many Times Do You Pull the Andon Cord Each Day? On Toyota’s pull rate as a kaizen signal, not a defect rate.
- Vorne: Andon (Etymology, Origins, History). Compact history of the term and its evolution.
For related reading on the same evidence base, see DORA Metrics: What the Research Actually Says and One Piece Flow in Software Delivery.