DORA Metrics: What the Research Actually Says


In 2024, three out of four developers told researchers they were more productive because of AI tools. The DORA data from the same survey — 39,000+ respondents — showed delivery throughput fell 1.5%, delivery stability fell 7.2%, and time spent on valuable work fell 2.6%. Both findings came from the same study, the same year, the same people.

This is not a DORA tutorial. It is a research evaluation — the evidence base, the real limitations, and what to do with imperfect but genuine data on Monday morning.


What Are DORA Metrics?

DORA stands for DevOps Research and Assessment — a research program, not just a metric set. It started in 2013 as the State of DevOps reports, a collaboration between Puppet Labs, Gene Kim, and Jez Humble. In 2014, Nicole Forsgren joined and brought PhD-level survey methodology — structural equation modeling, psychometric validation, the kind of rigor that turns industry surveys into something closer to science. Forsgren, Humble, and Kim co-founded DORA as an independent company. Google acquired it in 2019.

The program’s canonical reference is Accelerate (2018), which codified the framework from four years of survey data (2014–2017) and won the Shingo Publication Award. The research now spans over a decade and 39,000+ respondents in 2024 alone. Whatever your opinion of the findings, the longevity is real.

The Four Metrics

MetricCategoryWhat It Measures
Deployment FrequencyThroughputHow often you successfully release to production
Lead Time for ChangesThroughputTime from code commit to code running in production
Mean Time to RestoreStability*Time to restore service after a production incident
Change Failure RateStabilityPercentage of deployments causing a production failure

*In 2024, MTTR was reconceptualized as a throughput metric (how fast you recover), and a fifth metric — Deployment Rework Rate — was added to measure stability directly. The framework is not static. Anyone treating it as settled is already behind.

The Original Counterintuitive Finding

Before DORA, the industry assumption was that speed and stability trade off. Move fast, break things. Pick one.

DORA’s headline finding was that high performers achieve both simultaneously. Speed and stability correlate positively. This was genuinely counterintuitive in 2015, and it remains the most important and durable single finding from the program. The 2024 numbers put a scale on it: elite performers showed 127x faster change lead time, 182x more deployments per year, 8x lower change failure rate, and 2,293x faster failed deployment recovery compared to low performers (DORA 2024). That last number is not a typo — it reflects recovery times of minutes versus weeks, which is less about being 2,000 times better and more about operating in a fundamentally different way.

The Performance Tiers

The four tiers — Low, Medium, High, Elite — emerged from cluster analysis of each year’s survey data. They are not fixed benchmarks. They shift annually. Elite could not even be identified in 2022. In 2024, medium performers showed lower change failure rates than high performers — the first time since 2016. And in 2025, DORA abandoned tiers entirely. We will come back to this.


What the Research Says

Here is what the research says, in its own words — and with honest strength ratings. A note on citations: I’ve labeled each by evidence strength — Replicated, Promising, Industry data, or Claimed. The labels mean what they say.

The Key Studies

Accelerate (Forsgren, Humble, Kim, 2018) — Strength: Promising

The book that started the movement. Cross-sectional surveys with Likert scales, cluster analysis, and structural equation modeling across 23,000+ respondents from 2,000+ organizations. Key findings: speed and stability are not a trade-off; high performers are 2x more likely to meet commercial goals (productivity, profitability, market share); 24 capabilities across five categories predict performance; and Westrum generative culture predicts delivery performance.

One important distinction: this is often cited as “replicated” research because surveys ran for multiple years. It is not replicated in the scientific sense. It is the same team, same methodology, applied to different annual samples. That is longitudinal consistency — valuable, but not the same as independent replication.

State of DevOps Reports (2013-2025) — Strength: Promising

The longest-running software delivery research program in existence. Twelve-plus annual reports, growing from 4,000 respondents to 39,000+. The methodology evolved significantly — metrics were added, removed, and reconceptualized based on statistical construct validity. Change Failure Rate did not pass validity tests in 2019 and was refined. The elite cluster vanished in 2022. The 2023 report issued an explicit Goodhart’s Law warning. The 2025 report abandoned performance tiers entirely and introduced seven team archetypes.

This evolution is a feature, not a bug. Scientific self-correction is how research programs maintain integrity. But it also means the framework you read about in Accelerate is not the framework DORA is publishing today.

External Validation Attempts — Strength: Industry Data (sparse)

Very few exist — and that is itself a finding worth naming. Kunze et al. instrumented 37 services with objective telemetry and found strong deployment-frequency correlation in only 29% of systems. Junade Ali and Survation found that engineers and the public prioritize factors beyond the four metrics, and that risk appetite varies by sector — challenging the universal tier system. Microsoft Research telemetry showed 40% disagreement between quantitative metrics and developer sentiment. Google internal research (lagged panel analysis of their own engineering telemetry) found that code quality improvements precede productivity gains — making quality, not speed, the proven throughput lever.

The research base for DORA is DORA’s own surveys. External validation with objective data is sparse and shows mixed results. Name this directly when someone tells you the science is settled.

Forsgren’s Methodology — Context, not a rating

Forsgren’s PhD is in Management Information Systems, not experimental science. Her methodology — SEM, psychometric validation, Cronbach’s alpha — is rigorous for MIS survey research. It is not experimental research and should not be evaluated as such. She has been consistently transparent in interviews that the findings are correlational. The book’s language sometimes implies causation (“drives,” “impacts”) — that is a communication gap, not scientific fraud.

The Headline Numbers

The numbers people share in Slack:

Elite vs. Low (2024): 127x faster lead time. 182x more deployments per year. 8x lower change failure rate. 2,293x faster failed deployment recovery.

The AI Paradox (2024): 75.9% of respondents use AI daily. 75% report personal productivity gains. A 25% increase in AI adoption correlates with 1.5% decreased throughput, 7.2% decreased stability, 2.6% decreased valuable work time.

The AI Amplification Effect (2025): +21% individual task completion. +98% pull requests merged. +91% code review time. +154% pull request size. +9% bug rate.

The Industry Got Worse (2024): High performance cluster shrank from 31% to 22%. Low performance cluster grew from 17% to 25%.

Documentation Multiplier (2023): Teams with quality docs are 2x more likely to meet targets. Paired with trunk-based development: up to 12.8x organizational performance impact.

Business Outcomes (Accelerate): High performers are 2x more likely to meet commercial goals and 2x more likely to meet non-commercial goals (operating efficiency, customer satisfaction).

Where the Research Falls Short

These limitations do not make DORA useless. They make it useful in a more specific way than most implementations assume.

Everything is self-reported. No objective telemetry, no git log analysis, no deployment record validation. Three biases work against accuracy here. Response bias: the survey’s own respondents are already DevOps believers — organizations in crisis don’t self-select into DevOps surveys. Recall bias: nobody can accurately report their deployment frequency from memory. Social desirability bias — the lying-to-your-dentist effect — where people over-report the practices they know they should follow. High-morale teams rate everything more favorably, making it impossible to isolate which factor actually moves the needle. Forsgren’s psychometric validation confirms the survey measures something consistently. It does not confirm the survey measures reality accurately. Both things can be true. The full survey instruments and raw data have never been publicly released, making independent replication impossible (Lee, keunwoo.com).

Correlation, not causation. The cross-sectional design cannot establish causation. The book uses causal language the methodology does not support. The authors coined “inferential predictive analysis” — a term, as Lee’s review notes, that does not appear in the statistical literature they cite. They used p < 0.10 rather than the conventional p < 0.05. Practically: trunk-based development is widely cited as “proven” by DORA. The research shows correlation. It is equally plausible that high-performing teams are able to do TBD because they already have strong testing, CI/CD, and psychological safety — not that TBD caused their performance. Forsgren has acknowledged this in interviews. The book is less clear.

The tiers are not fixed targets. They emerge from cluster analysis of each year’s data — statistical artifacts, not permanent benchmarks. Elite could not be identified in 2022. Medium outperformed high on CFR in 2024. DORA abandoned tiers entirely in 2025. Anyone benchmarking against “elite tier” thresholds is using a model DORA itself retired. If your OKRs reference elite-tier numbers, your OKRs are built on sand.

Velocity without value. A team can score elite on all four metrics while building features nobody wants. DORA measures pipeline speed and reliability. It is blind to whether the output creates business value. DORA’s own 2022 data confirmed this: delivery performance alone did not predict organizational success. The 2025 archetype model added “product performance” and “valuable work” dimensions — DORA’s own response to a gap DORA’s own data revealed.

Circular reasoning in some capabilities. From Lee’s review: “Deployment automation is highly correlated with fast deployments.” Some of the 24 capabilities are near-prerequisites for the metrics themselves. CI/CD practices predicting deployment frequency is definitionally true, not empirically surprising.


The Debate

Is the sample representative? DORA’s respondents self-select. They identify as DevOps practitioners — people already invested in engineering excellence. Organizations in crisis, with dysfunctional cultures, or without a DevOps identity are systematically underrepresented. This is not unique to DORA, but it means the findings describe what works in organizations that have already bought in. The counter-argument: maybe the point is that it’s a target, not a description. Derived from top performers, useful as a north star even if it doesn’t describe your org today. Both are right.

Goodhart’s Law — including DORA’s own warning. When a measure becomes a target, it ceases to be a good measure. The 2023 State of DevOps report said this explicitly: “Creating league tables leads to unhealthy comparisons and counterproductive competition.” The slowest team might have improved the most. Comparing teams with different applications and infrastructure “often isn’t productive.” In practice, teams split deploys into trivial changes to inflate Deployment Frequency. Low performers take stability hits trying to force throughput. The irony is thick: the framework that created industry benchmarks warned its own users not to benchmark with it.

The AI productivity paradox — unresolved. Individual productivity goes up. System-level performance goes down. Both findings come from the same survey, same respondents, same year. The hypothesized mechanism is plausible: AI increases batch sizes (+154% PR size), larger changesets are riskier, code review becomes a bottleneck (+91% review time). “Code generation isn’t the bottleneck” (RedMonk) is probably the most useful single-sentence framing — the constraint was never typing speed, and AI optimizes typing speed. The 2025 “mirror/amplifier” finding adds another layer: AI degrades performance for teams lacking user-centric orientation. Not stagnation — degradation. We do not yet know whether organizations that build the right capabilities and then adopt AI will see the numbers recover. The honest answer is: this is unresolved.

Has DORA evolved into a better framework — or just a different one? The 2025 report introduced seven team archetypes that blend delivery metrics with human factors (burnout, friction, valuable work). More nuanced? Certainly. But the original simplicity — four metrics, four tiers — is what made DORA adoptable. Seven archetypes are harder to measure and harder to move. “Constrained by Process” is a diagnosis, not a metric. DORA would say the old model was insufficient, as proven by their own 2022 data failure and 2023 Goodhart warning. Skeptics would say the new model is harder to act on. Both are right.

It is worth noting that DORA is not the only framework in this space. Forsgren herself co-authored the SPACE framework (2021) — Satisfaction, Performance, Activity, Communication, Efficiency — which explicitly broadens beyond delivery metrics. DevEx (2023, Greiler, Storey, Zagalsky) takes a developer-experience-centered approach. The trend across all frameworks is the same: delivery speed alone is not enough.


So What? Monday Morning Actions

Here is what to do with imperfect but real evidence.

If You’re an IC

Reduce your batch size — especially when using AI. AI increases PR size by 154% (DORA 2025). Larger changesets are riskier changesets. Actively resist AI’s tendency to generate in bulk. Every DORA report since Accelerate (2018) reinforces this: smaller batches, faster feedback, fewer failures.

Treat AI output as draft, not done. AI adoption increased bug rates by 9% and code review time by 91% (DORA 2024-2025). AI-generated code deserves the same review rigor as human code. Not more, not less.

Know your rollback procedure. MTTR is the metric most within an IC’s direct control (Accelerate, 2018). Do not wait for an incident to discover whether you can roll back cleanly. Practice it.

Invest in documentation. Teams with quality docs are 2x more likely to meet targets; paired with trunk-based development, the multiplier reaches 12.8x (DORA 2023). Documentation is almost never discussed in DORA conversations. It should be.

If You’re a Team Lead

Use DORA metrics as diagnostics, not targets. DORA’s own 2023 report said it: “Creating league tables leads to unhealthy comparisons and counterproductive competition.” Use the metrics to identify bottlenecks, not to rank teams.

Measure burnout and friction alongside delivery. The 2025 archetype shift exists because delivery metrics alone miss team health (DORA 2025). A “Pragmatic Performer” team — high speed, low engagement — is not healthy. It is a team running on borrowed time.

Protect code review capacity before scaling AI. AI adoption increased review time by 91% (DORA 2025). Adding AI coding tools without adding review capacity creates a downstream bottleneck. The throughput gain gets absorbed by the review queue.

Establish explicit AI policies. “Clear AI stance/policies” is the first of seven capabilities that predict AI success (DORA 2025). Teams with explicit guidance outperform those that adopted ad hoc.

If You’re a VP/Director

Stop comparing teams on DORA scores. DORA explicitly warns against it (DORA 2023). Different applications, different infrastructure, different constraints. Measure improvement trajectories, not absolute scores.

Treat AI adoption as organizational transformation. AI amplifies what already exists (DORA 2025). Dysfunctional organizations with AI get more dysfunction, faster. Fix the org first, then scale AI. The seven capabilities are your checklist.

Pair delivery metrics with value metrics. DORA’s own 2022 finding: delivery performance alone does not predict organizational success. Add product performance, user satisfaction, and business outcome measures. Elite DORA scores without value metrics are vanity metrics with better branding.

Invest in internal developer platforms, not mandates. Platforms drive measurable improvements across all four metrics (DORA 2024-2025). Management directives alone do not.

Fund documentation as infrastructure. The 2x/12.8x multipliers (DORA 2023) make documentation one of the highest-ROI investments available. It is chronically underfunded because it is not glamorous. Treat it like infrastructure, not overhead.


Further Reading

Forsgren, Humble, Kim. Accelerate (2018) — The primary source. Read it, especially the methodology appendix that most people skip. You cannot cite DORA credibly without reading the original. ~200 pages, accessible to non-academics. itrevolution.com

Keunwoo Lee. “A Review of Accelerate” — The most thorough published methodological critique in existence. Essential reading if you are making organizational decisions based on DORA data. Lee is not hostile to the findings — he is precise about what they do and do not support. keunwoo.com

DORA Metrics History Page — Official timeline of how the metrics evolved — what was added, removed, and why. Proof that DORA is not a static framework, for anyone who thinks “four metrics, four tiers” is the complete picture. dora.dev

ScopeCone. “Engineering Metrics: A Pragmatic Evidence Review” — Comparative evidence review of DORA, SPACE, and DevEx with strength ratings. Identifies the strongest validated finding across all frameworks: code quality drives throughput. scopecone.io

2025 DORA Report: State of AI Assisted Software Development — The current state of the framework. The archetype model, the AI amplification findings, the seven capabilities. Understand where DORA is going before implementing where it was. cloud.google.com