The Overfitting Epidemic: Why Most Quant Research Is Wrong
The previous analysis on volatility targeting works because it relies on a robust, simple relationship: recent volatility predicts near-term volatility. The signal is persistent, present across decades and asset classes, and grounded in a clear mechanism. Most quantitative strategies fail because they rely on complex, fragile relationships that exist only in the backtest.
The Publication Problem
McLean and Pontiff (2016), in “Does Academic Research Destroy Stock Return Predictability?” published in the Journal of Finance, conducted a systematic examination of 97 published return-predictability signals. Their method was straightforward: take the anomalies as published, verify the in-sample performance, then test what happened to the same signals after the papers were published.
The results were unambiguous. Published anomalies decayed by an average of 32% in the post-publication period compared to the in-sample period. More strikingly, when tested on truly out-of-sample data that the paper’s authors had never seen — a separate historical period preceding the authors’ sample — the anomalies decayed by 58%.
Read that figure precisely: 58% out-of-sample decay. The signals worked in the data the researchers analyzed. They worked substantially less well in data the researchers had not analyzed. This is the signature of overfitting — not as a statistical abstraction, but as a quantified, systematic force operating across the published academic literature.
McLean and Pontiff’s interpretation was careful: some of the post-publication decay reflects arbitrage — sophisticated investors reading the papers and trading away the anomaly. But the out-of-sample decay in data preceding the authors’ sample cannot be explained by arbitrage. That decay can only be explained by one thing: the original results were partly discovered in noise.
The Factory Behind the False Discoveries
Harvey, Liu, and Zhu (2016), in “…and the Cross-Section of Expected Returns” published in the Review of Financial Studies, catalogued more than 300 factors that had been proposed in the academic literature as predictors of stock returns. Their conclusion: given the number of tests conducted across the literature, most of the factors with modest t-statistics are likely false discoveries.
The logic is exact. Suppose you are testing whether a coin is fair. If you flip it 20 times and get 12 heads, the result is not statistically significant — you cannot reject fairness at standard thresholds. But if 100 different researchers each flip a different coin 20 times and report only the ones that show significant results, you will see several significant results in the published literature by pure chance. The coins are all fair. The researchers are all honest. The literature is still wrong.
This is the multiple testing problem applied to financial research. If 300 factors have been tested across the academic literature, some subset will appear significant at p < 0.05 purely by chance, with no true underlying relationship. The conventional significance threshold of p < 0.05 was calibrated for a world where a single hypothesis is being tested. In a world where hundreds of hypotheses are being tested, the threshold must be adjusted upward substantially.
Harvey et al. (2016) proposed that a published factor should require a t-statistic of approximately 3.0, not the traditional 2.0, to clear the multiple testing bar given the extent of data mining in the cross-section of expected returns literature. At the adjusted threshold, many of the published factors fail to clear. Their estimate: the majority of the 300+ factors documented in the literature are likely to be false discoveries — relationships present in specific historical samples that do not reflect any true underlying causal mechanism.
The Probability of Backtest Overfitting
Lopez de Prado (2018), in “Advances in Financial Machine Learning,” formalized the framework for measuring the probability of backtest overfitting directly. His central measure — the Probability of Backtest Overfitting (PBO) — quantifies the likelihood that a strategy which was selected from a set of candidate strategies based on in-sample performance will underperform out-of-sample.
The mathematics are direct. If a researcher tests $N$ strategies on $T$ observations, the expected number of spuriously good-performing strategies grows with $N$ and shrinks with $T$. A researcher who tests 1,000 strategies on 10 years of monthly data (120 observations) and selects the best performer will select a spuriously overfit strategy with high probability — because there are enough degrees of freedom to fit noise. A researcher who tests 5 strategies on 90 years of data (1,080 observations) will select an overfit strategy with much lower probability.
Lopez de Prado’s empirical finding: for realistic financial research processes — where researchers iterate through many parameter choices, signals, lookback periods, and implementations — the number of effectively tested strategies is orders of magnitude larger than the number reported. A paper that reports testing three versions of a strategy has typically gone through dozens or hundreds of variations before arriving at the specification shown. Each unreported iteration contributes to the effective multiple-testing count even when it is not disclosed.
The result is that the reported backtest performance is almost always optimistic. The degree of optimism depends on the number of implicit tests conducted and the length of the data used. For strategies tested on recent decades of data — which most published quantitative work uses — the expected out-of-sample Sharpe ratio is substantially lower than the in-sample estimate. In many cases, the true expected out-of-sample return is near zero.
The False Positive Factory: A Precise Illustration
Bailey, Borwein, Lopez de Prado, and Zhu (2014), in “Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance,” provided the clearest illustration of the false positive problem in a financial context.
Their core illustration: if you test 1,000 strategies on the same data at a significance level of p < 0.05, you expect approximately 50 false positives — strategies that appear to work but are purely driven by chance. If you then report only the 50 best performers, as is implicitly what a selection process does, you have produced what looks like compelling evidence for 50 strategies, none of which have any actual predictive relationship with future returns.
The authors coined the term “pseudo-mathematics” for the use of rigorous-looking statistical methods to support conclusions that the statistical methods, properly applied, do not support. The appearance of mathematical rigor — t-statistics, p-values, Sharpe ratios, maximum drawdown statistics — creates confidence in results that are actually artifacts of data mining.
Bailey et al. (2014) identified the structural conditions that make the false positive problem most severe:
- Short data samples relative to the complexity of the strategy being tested
- Many unreported iterations on a single dataset
- Strategy selection based on Sharpe ratio or similar in-sample metrics
- No adjustment to significance thresholds for the effective number of tests conducted
- No hold-out sample preserved for truly out-of-sample testing
These are not exotic edge cases. They describe the standard research process used by most practitioners and most academics publishing in quantitative finance.
Why Recent Data Makes Everything Worse
The data selection problem compounds the multiple testing problem in a specific way that deserves emphasis.
Most publicly accessible financial data with reasonable quality and breadth begins in the 1990s or early 2000s. This means that the majority of quantitative strategy research is implicitly tested on a single extended regime: the bull market from 2009 to 2021, bookended by the GFC and the rate tightening cycle that began in 2022.
From 2009 to 2021, U.S. equities delivered exceptional returns in a low-volatility, low-inflation, declining-rate environment. Strategies that worked during this period benefited from regime-specific tailwinds that have no reason to persist. A strategy optimized on this data may have been selecting for exposure to factors that happened to be rewarded in this specific regime — not for factors that will be rewarded in future regimes.
The 2022 environment stress-tested this directly. Many systematic strategies that had delivered impressive backtests over 2010–2021 underperformed significantly in 2022 because they were implicitly long duration, long valuation-expansion, and short volatility — all of which worked in the 2010–2021 regime and failed together in 2022 when the regime shifted. The strategies were not wrong by bad luck. They were wrong because they were developed and selected on data that did not include the regime in which they were then deployed.
The minimum data requirement for any serious strategy evaluation should include multiple full economic cycles, multiple interest rate regimes, and at minimum the major market dislocations of the 20th century. A strategy that delivers positive risk-adjusted returns across the Great Depression, the post-war expansion, the stagflation of the 1970s, the disinflationary expansion of the 1980s and 1990s, the dot-com cycle, the global financial crisis, and the post-crisis ZIRP era has provided evidence of robustness that no amount of optimization on post-2009 data can substitute for.
What Out-of-Sample Actually Means
The term “out-of-sample” is routinely misused in financial research, and the misuse matters.
Genuine out-of-sample testing requires that the data used for testing has not influenced any decision in the strategy development process — not the signal selection, not the parameter tuning, not the model architecture, not the data cleaning rules. If any part of the strategy was developed with awareness of the test period’s outcomes, the test is not truly out-of-sample. It is contaminated.
In practice, this standard is almost never met. Researchers who work with data over a specific period inevitably develop intuitions about patterns in that data, even when they formally reserve a holdout sample. The knowledge that momentum worked during one period and failed during another — even if that knowledge does not directly influence the backtest — shapes what signals the researcher thinks are worth investigating, which models they build, and which results they choose to publish.
The only test that is genuinely out-of-sample in a meaningful sense is live forward testing: deploying a strategy with real capital on data that was not in existence when the strategy was built. This test is expensive in time and capital, which is why it is rarely conducted. But it is the only test that provides clean evidence. Every other form of “out-of-sample” evaluation is a compromise that retains some degree of contamination.
The implication is that the correct prior toward any backtested financial strategy should be skepticism, not credulity. A backtest that shows impressive results should be treated as a hypothesis, not as evidence. The hypothesis is: “this strategy works.” The evidence needed to accept that hypothesis is out-of-sample performance on data the strategy has never seen. Until that evidence is accumulated, the backtest is an argument for why the strategy might work — which is entirely different from evidence that it does.
The Signal in the Noise
None of this means that quantitative research is worthless. Some signals are robust, theoretically grounded, and persistent across genuinely independent samples. Momentum, value, and carry have decades of out-of-sample evidence across multiple independent data sources and geographies. They have theoretical accounts that explain why they should persist. They have been scrutinized by thousands of researchers and have survived the scrutiny in at least partial form.
The robustness criteria that separate real signals from overfitted artifacts are clear in the literature: the signal should be economically intuitive, present across multiple asset classes, persistent across different time periods including those not used in discovery, survive transaction cost adjustments, and remain significant after applying appropriate multiple-testing corrections. Strategies that meet all of these criteria exist — they are just far fewer than the published literature would suggest.
The practical implication for any investor evaluating a quantitative strategy — whether from an academic paper, a manager pitch, or their own research — is to demand answers to the following questions: What is the out-of-sample track record, and how cleanly out-of-sample is it? How many strategies were implicitly tested before this one was selected? What is the theoretical mechanism, and why should it persist? Does the signal hold across different geographies, asset classes, and historical periods — including periods that predate modern financial markets?
A backtest is a hypothesis. Only out-of-sample performance on truly unseen data matters.
What Follows
Understanding overfitting clarifies one dimension of what makes systematic investing hard — the research and validation challenge. But even strategies that survive rigorous out-of-sample validation face a second category of challenge that is less discussed: the costs of actually implementing them.
The next analysis examines the costs investors obsess over — and the far larger costs they completely ignore.
References: McLean and Pontiff (2016), “Does Academic Research Destroy Stock Return Predictability?”, Journal of Finance, Vol. 71, No. 1. Harvey, Liu, and Zhu (2016), “…and the Cross-Section of Expected Returns,” Review of Financial Studies, Vol. 29, No. 1. Bailey, Borwein, Lopez de Prado, and Zhu (2014), “Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance,” Notices of the American Mathematical Society, Vol. 61, No. 5. Lopez de Prado (2018), “Advances in Financial Machine Learning,” Wiley.
Get OVRWCH's regime report and trade analysis.
Free. No spam. Unsubscribe anytime.
We'll connect this to Beehiiv when we launch.