Backtesting strategies based on multiple signals — Beware of Overfitting Bias!

Backtesting strategies based on multiple signals — Beware of Overfitting Bias!

June 28, 2016 Research Insights
Print Friendly
(Last Updated On: June 28, 2016)

One of the dangers of being a quantitative investor is that when you see patterns in historical data you might wrongly assume they will repeat. Put another way, you might believe an effect is driven by a genuine relationship, when in reality the results are spurious and the result of luck. We wrote here about “anomaly chasing” and the risks of data mining in backtests.

A responsible researcher has tools to address this risk. The “gold standard,” of course, is to review out-of-sample results. In the absence of out-of-sample data, a skeptic might look for a high degree of statistical significance. Unfortunately, t-statistics are not a silver bullet, and sometimes should be taken with a grain of salt. This is especially true when you are optimizing on or weighting groups of signals that perform the best together, even though they may be completely unrelated, and/or the result of noise.

Dr. Denys Glushkov, who is a portfolio manager at Acadian Asset Management, and the former research director at Wharton Research Data Services, as well as a friend of Alpha Architect blog, recently recommended that we read a provocative paper, “Testing Strategies Based on Multiple Signals,” by Novy-Marx, that addresses this issue.

In his paper, Novy-Marx argues that many multi-signal strategies suffer from overfitting bias. Multi-signal strategies often rely on a composite measure that combines multiple signals. Novy-Marx cites some famous examples:

Novy-Marx distinguishes overfitting bias from selection bias, which occurs when a researcher examines a variety of signals and identifies the best performer, but fails to account for the extent of the search conducted. Under selection bias, a sample of signals tested may be non-random, and thus may require a correction for this bias.

Here is the intuition on how “pure” overfitting and selection biases operate:

  1. Pure Selection Bias: A researcher considers many signals, and only reports the best performing signal. The researcher may not report the number of signals tested or the manner in which data were chosen for study, either of which might impart a bias to the results. (see the Harvey, Liu, and Zhu paper for more details)
  2. Pure Overfitting Bias: A researcher uses all the signals considered, but optimize for a combination of signals.  Here, the combination is “overfit” to the data examined, and therefore may have no predictive power. Additionally, Novy-Marx observes that researchers may sign the signals (i.e., assign them either a positive or negative directional sign) and/or weight them to predict returns.

Selection bias can magnify the effects of overfitting, in exponential fashion, consistent with a power law. Novy-Marx attempts to measure the interaction and magnitudes of overfitting and selection biases with an experiment.

How do Researchers Test Ideas?

In practice a research process might involve a researcher considering n signals, and combine the best k of these to select stocks.

If k=1, this may result in pure selection bias.

If k=n, this may result in pure overfitting bias.

If 1 < k < n, the result may be a combination of selection and overfitting.

How Severe are Overfitting and Selection Biases?

If you look at a large sample of signals, some combinations of the signals will perform strongly, just by dumb luck. Novy-Marx makes an amusing analogy: Consider a group of monkeys throwing darts at the stock pages of the Wall Street Journal. Some monkeys will have great performance, and selecting a combination of these monkeys in-sample will yield spectacular performance. Of course, this group will not perform out of sample.

Using real stock data from January 1995 through December 2014, Novy-Marx generates random signals, with none on its own having any power to predict returns.

The figure below shows the backtested t-statistics for strategies based on combinations of between 2 and 10 signals:

2016-06-23 12_01_01-MSES_160520 - Copy.pdf - Adobe Acrobat ProWhat do you notice?

  • All of these signals have t-value < 2.0 in their individual backtests. So, none of the underlying signals are themselves significant, or have any true predictive power.*
  • The bottom yellow dotted line is the strongest result (k) from a set of n random strategies, representing pure selection bias.
  • The solid red line equal weights all of the n signals, but signs them (positively or negatively) to predict the strongest in-sample returns, representing pure overfitting.
  • The top blue dotted line use all of the n signals, are signed, and are also signal-weighted to maximize returns, representing an even more extreme form of pure overfitting.
  • The figure shows that the combined signals are highly significant, with t-value positively correlated with the number of signals used (k).

*t-value measures statistical significance of a t-test. Generally, if a t-value is greater than 2, the estimate is statistically significant at the 5% level. The higher the t-value, the greater the confidence we have in the coefficient as a predictor. Here’s a table of critical values for the t-distribution if you are interested.

Note how in the upper two lines that when the researchers give themselves additional discretion to weight the signals, the results become more extreme–talk about overfitting!

Selection Bias Equivalence

In order to demonstrate how sample selection and overfitting biases can interact with exponential effects, Novy-Marx reports the “selection bias equivalence,” which shows the relationship between selection and overfitting bias.

The Table below shows single-signal equivalent sample sizes.

From the paper:

Panel A, which shows the case when multiple signals are equal weighted, shows that using just the best three signals from 20 candidates yields a bias as bad as if the investigator used the single best performing signal from 1,780 candidates. With five signals selected from 40 candidates, the bias is almost as bad as if the investigator had used the single best performing signal from half a million candidates.

2016-06-22 16_21_32-MSES_160520 - Copy.pdf - Adobe Acrobat Pro

Note how in Panel B the selection bias equivalent numbers can get completely crazy. Novy-Marx derives a power law from this relationship. From the paper: “The bias resulting from combining the best k signals from a set of n candidates is almost as bad as that from using the single best signal out of n^k candidates.”

What Are the Implications?

All of this should serve to make you suspicious of any multi-signal strategy you may come across in the market. Why? Because the statistical bar gets very high very fast. Consider a strategy that uses 20 signals. Do you think the people who developed the strategy examined more than 20 signals before coming up with it? You bet they did.

Yet we cannot use conventional statistical tests to evaluate such multi-signal strategies. A great backtest with a strong t-statistic is not really telling you very much.

Novy-Marx suggests some advanced statistical methods to address this issue, but also suggests some simplifying rules of thumb. At the very least, you can evaluate multi-signal strategies based on each individual signal, controlling for the fact that the researchers may have investigated many signals.

Calibrating Statistical Significance Based on Sample Size

Say you want to test n signals, and determine the best performer with statistical significance at the 5% level.

If you examine 10 signals, there is a: 1-(1-.05)^10=40% chance of observing a significant result by chance, even if all the signals are insignificant. When you examine 20 signals, there is a: 1-(1-.05)^20=64% chance. As n increases and you examine more and more potential signals, the chance of finding one that is significant due to chance goes higher and higher.

We need a way to adjust our measure of “statistical significance” so that it accounts for this. Novy-Marx suggests the Bonferroni correction, which imposes a higher standard that should apply when we are looking a large number of n signals. Novy-Marx observes that, “If one suspects that the observer considered 10 strategies, significance at the 5% level requires that the results appear significant, using standard tests, at the 0.5% level.”

Conclusion

Novy-Marx takes pains to point out that he is not suggesting that investors should not use multi-signal strategies. Good standalone signals can be combined in effective ways to make viable composite strategies. He does want to highlight, however, that a great backtest of a combination of signals does not tell you whether the strategy will do well out of sample, or if the individual signals embedded in the the multi-signal strategy are any good. And it may be reasonable to conclude from reading this paper that a multi-signal strategy with a lot of signals should be critically evaluated in this light.

Stepping back, thinking about out of sample performance gets even hairier. Consider a situation where we correctly identify a genuine signal in the data using the latest whiz-bang statistical techniques, adjusted for selection bias and overfitting. Does this automatically imply the signals will work out of sample? Unlikely. Markets are highly dynamic and competitive. What if others find out about the signal? Will it be arbitraged away? We recommend all investors answer two basic questions when analyzing a process, be it quantitative or qualitative, and attempt to assess out of sample expected performance:

  1. What is the edge and/or risk that drives expected returns?
  2. Why isn’t everyone doing it?

Answering these 2 questions will help an investor assess the out of sample sustainability of a process, regardless of whether is data-mined or not. A detailed discussion of this high-level framework is discussed in our sustainable active investing framework.


Backtesting Strategies Based on Multiple Signals

Abstract:

Strategies selected by combining multiple signals suffer severe overfitting biases, because underlying signals are typically signed such that each predicts positive in-sample returns. “Highly significant” backtested performance is easy to generate by selecting stocks on the basis of combinations of randomly generated signals, which by construction have no true power. This paper analyzes t-statistic distributions for multi-signal strategies, both empirically and theoretically, to determine appropriate critical values, which can be several times standard levels. Overfitting bias also severely exacerbates the multiple testing bias that arises when investigators consider more results than they present. Combining the best k out of n candidate signals yields a bias almost as large as those obtained by selecting the single best of nk candidate signals.


Note: This site provides no information on our value investing ETFs or our momentum investing ETFs. Please refer to this site.


Join thousands of other readers and subscribe to our blog.


Please remember that past performance is not an indicator of future results. Please read our full disclaimer. The views and opinions expressed herein are those of the author and do not necessarily reflect the views of Alpha Architect, its affiliates or its employees. This material has been provided to you solely for information and educational purposes and does not constitute an offer or solicitation of an offer or any advice or recommendation to purchase any securities or other financial instruments and may not be construed as such. The factual information set forth herein has been obtained or derived from sources believed by the author and Alpha Architect to be reliable but it is not necessarily all-inclusive and is not guaranteed as to its accuracy and is not to be regarded as a representation or warranty, express or implied, as to the information’s accuracy or completeness, nor should the attached information serve as the basis of any investment decision. No part of this material may be reproduced in any form, or referred to in any other publication, without express written permission from Alpha Architect.




About the Author

David Foulke

Mr. Foulke is currently an owner/manager at Tradingfront, Inc., a white-label robo advisor platform. Previously he was a Managing Member of Alpha Architect, a quantitative asset manager. Prior to joining Alpha Architect, he was a Senior Vice President at Pardee Resources Company, a manager of natural resource assets, including investments in mineral rights, timber and renewables. He has also worked in investment banking and capital markets roles within the financial services industry, including at Houlihan Lokey, GE Capital, and Burnham Financial. He also founded two technology companies: E-lingo.com, an internet-based provider of automated translation services, and Stonelocator.com, an online wholesaler of stone and tile. Mr. Foulke received an M.B.A. from The Wharton School of the University of Pennsylvania, and an A.B. from Dartmouth College.


  • IlyaKipnis

    Good post. I’m not sure about the bonferroni correction, however, since a lot of tests are highly correlated. IMO, Marcos Lopez De Prado’s deflated sharpe ratio paper is a bit better.

  • I think you deflate Sharpe when trials are independent If tests are correlated, why deflating Sharpe? Regardless, these are pseudo-scientific claims that use non-parametric analysis that ignores underlying strategy logic. Maybe they are good tests for some random strategies developed by ML algos but most traders use strategies that have some logic that relates to market behavior and microstructure.

  • IlyaKipnis

    Not necessarily. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2460551 EG if you’re doing robustness testing, and, say, you’re doing a moving average crossover, well, an SMA 50/200 isn’t going to be much different from an SMA 51/200.

    It’s to punish methodologies that data-mine the best performance.

  • Page 7 of the paper: “More formally, consider a set of N independent backtests or track records associated with a…”

    Are SMA 50/200 and SMA 51/200 independent trials?

  • IlyaKipnis

    Ah, fair enough.

  • Hi David,

    Nice summary of the article. This is great advice you offer:

    What is the edge and/or risk that drives expected returns?

    Why isn’t everyone doing it?

    As far as the article it is one of the best I have read on this subject. However, the critical t-statistic values are based on a specific universe during a specific period of time. Although that demonstrates the effects of overfitting and selection bias, it is not a general result.

    Furthermore, if someone intends to use Benferroni, better change business. Type II error is extremely high and nearly all strategies found will be rejected. There is a difference between the way academics see this and practice. Let us not forget that academics lag one step behind practice in all fields.

  • Hey Michael,

    Agree…Backtesting can help us identify premiums, but an understanding of market participants incentives/fears will help us ascertain if the premiums are real.

    Not sure it is a fair statement to suggest that academics lag behind practice–depends on your sample. When it comes to finance I’d say there are always a select few who are ahead of the Ivory Tower, but the vast majority of practitioners are a few iterations behind…primarily because they have limited mindshare they can spend on R&D and more is needed for operations/compliance/selling/etc.

  • I agree it depends on the sample. I meant the successful ones. Academia has done a great job recently in alerting traders about the risks of overfitting and selection bias. The select few keep the knowledge secret because their profits come from lack of knowledge and skill of counterparties.

    BTW, what is your opinion about the issue Ilya and I discussed below?

    Are SMA 50/200 and SMA 51/200 independent trials?