The 100-Year Backtest Manifesto: Why Most Quant Strategies Die After Three Years

Research April 11, 2026 8 min read

The Thesis

The dominant failure mode of systematic trading models is not alpha decay. It is not execution slippage. It is not counterparty risk, capacity exhaustion, or any of the other threats you read about in risk management textbooks. It is something simpler and more embarrassing: most systematic strategies are backtested on three to five years of data, trained on the most homogeneous regime in modern market history, and deployed with the confidence of models that have never seen a regime they cannot recognize.

The period from 2009 through early 2022 is the single most anomalous market environment in documented history. Near-zero interest rates, compressed volatility, a nearly unbroken bull market in equities, subdued inflation, and the most generous central bank liquidity posture on record. An ML model trained on this window has learned exactly one lesson: own risk assets, rebalance on autopilot, and the drawdowns will resolve themselves.

When 2022 arrived with simultaneous declines in stocks and bonds — a correlation structure that had been absent for forty years — the models that had been optimized on the previous decade did not adapt. They could not adapt. Their training data contained no example of the environment they were being asked to trade.

The Available Data

The most common justification for short backtest windows is that long historical data does not exist. This is false. The following datasets are freely available, have been maintained by academic institutions for decades, and are used in the majority of peer-reviewed finance research published each year:

Shiller’s long-run dataset provides monthly S&P 500 values, ten-year treasury yields, consumer price index, dividends, and earnings going back to January 1871. That is one hundred and fifty-five years of monthly equity, bond, and inflation data, maintained by Yale.

The Kenneth French data library provides daily factor returns — market, size, value, momentum, profitability, and investment factors — going back to July 1926. That is one hundred years of daily systematic factor data, maintained by Dartmouth.

FRED, the Federal Reserve Bank of St. Louis, provides hundreds of macroeconomic time series. Many extend decades into the past. Industrial production, unemployment, jobless claims, consumer sentiment, credit spreads, yield curves, and financial conditions indices are all available through a free public API.

The National Bureau of Economic Research publishes business cycle dating — the exact start and end months of every recession — going back to December 1854. That is more than one hundred and seventy years of regime labels, ready to use as a classification target.

The Dimson-Marsh-Staunton Global Returns Yearbook documents equity, bond, and bill returns for twenty-three countries from 1900 through the present. That is one hundred and twenty-six years of cross-sectional international return data.

These datasets are free. They do not require a Bloomberg terminal. They do not require institutional credentials. A determined researcher can download all of them in under an hour. The reason most quant teams do not use them is not that they are unavailable. It is that setting up a historical data pipeline requires more work than exporting the last decade from whatever vendor feed the firm already pays for, and that work is rarely prioritized.

The Regime Test

Any strategy that claims to be robust must be validated against the regimes it will eventually face. At minimum, this requires positive risk-adjusted return across the following market epochs, evaluated independently:

The Great Depression and the deflationary crash of 1929 through 1933. The New Deal recovery and the premature tightening of 1937. The wartime market of 1939 through 1945, with its fixed rates and capital controls. The post-war boom of 1945 through 1966. The stagflation of 1966 through 1982. The Volcker disinflation of 1979 through 1982, with short rates above twenty percent. The secular bull of 1982 through 2000. Black Monday of October 1987. The Japan bubble and its subsequent fourteen-year unwind. The dot-com mania and its aftermath. The Great Financial Crisis of 2007 through 2009. The Eurozone sovereign debt episode. The zero-interest-rate regime of 2009 through 2022. The 2022 simultaneous stock-bond decline. And the current rate normalization cycle.

That is fifteen distinct environments. A strategy that works in ten of them has a defensible claim to regime robustness. A strategy that works only in the last one is indistinguishable from a strategy that was curve-fit on its own training set.

Why This Is Not Done

If the data is free and the methodology is understood, why does most quant research still train on five years and call it sufficient? Three reasons.

First, short windows produce better-looking backtest plots. A strategy trained and tested on 2015 through 2024 will show a cleaner equity curve than one tested across the Great Depression, because the underlying market environment was less variable. The clean plot is more persuasive in pitch meetings and internal reviews, even when it is less informative about forward risk.

Second, short windows permit higher model complexity without triggering obvious overfitting alarms. A model with twenty features and five hyperparameters will overfit embarrassingly on thirty years of data but can appear to generalize on seven. Shorter windows give researchers more degrees of freedom to tune, and that freedom is frequently interpreted as signal.

Third, long historical data requires engineering effort that most quant teams deprioritize. Loading Shiller’s CSV is trivial. Integrating it into a unified pipeline alongside French’s factor data, FRED macro series, and NBER regime labels, then aligning their frequencies, handling forward-fills correctly, and maintaining the pipeline as sources update — that is a multi-week engineering project. It is not the kind of work that produces a weekly dashboard for the PM to look at. So it does not get done.

The Conclusion

A systematic strategy is only as robust as the worst environment it has been tested against. If the worst environment in its training data is 2015, the strategy’s actual robustness is approximately “optimized for 2015.” If the worst environment is 1929, the strategy has been stress-tested against an environment that is almost guaranteed to exceed anything it will face in production.

The decision to use long historical data is not a technical one. It is a methodological commitment that protects the people whose capital is at stake. A strategy that cannot survive 1973 has no business managing money in 2026, because the 2026 version of 1973 is always closer than anyone thinks.

OVRWCH’s research and validation framework uses the longest available historical window for every asset class it tracks. No strategy enters production without demonstrating positive risk-adjusted return across the majority of the regime epochs documented above. Specific minimum window requirements are proprietary.

Get OVRWCH's regime report and trade analysis.

Free. No spam. Unsubscribe anytime.

We'll connect this to Beehiiv when we launch.

Your Next Move

Build your foundation — these are the moves that compound.

What Is a Brokerage Account

You're losing years of compound growth for every month you don't have one.

How to Invest $1,000

That cash sitting in your checking account lost 3% to inflation last year.

What Is Dollar-Cost Averaging

Timing the market costs the average investor 1.5% per year. There's a better way.