“All models are wrong, but some are useful”
-Statistician George Box, 1976
This aphorism is particularly true of predictive models.
I’m not a fan of statistical theory. It tends to mold the world into a statistical model; to a hammer everything is a nail. It’s the same reason I don’t go to a surgeon to diagnose an illness because instead of attributing the illness to a reversible condition that can be cured by diet and exercise, the surgeon will tell you what you need to cut out. They don’t mean you any harm, but their span of awareness and control is horribly flawed. That said, George Box has it right here and I want to assure you that we’re working hard behind the scenes to make our models more useful. This post details exactly what we’re doing.
We’re on the hunt for the holy grail of trade strategy
As traders on the hunt for the holy grail of trade strategy, one of the most important elements of our hunt must be the reliability of backtests and or the reliability of the process we use to discover new strategies. When we say reliability, we mean the ability for the strategy to perform within the parameters that it backtested in. And, just because your backtest is reliable, it doesn’t necessarily mean that it’s going to perform the same way in the future, but this is the goal.
Why do we backtest? We backtest because it tells us which strategies might have an edge in the market. This is how we know if we’re going to continue to research a strategy or not. So if the backtest isn’t accurate, our entire process is flawed. In many ways, backtesting is the foundation of our process and this post is detailing what we’re doing to ensure we’re on solid ground.
Renko Vs. Range
A few months ago we published the post, Impaled By a Unicorn. I talk about how we thought we had found the holy grail of trade strategy only to be greatly disappointed. At the time I thought the discrepancy was due to tight stops and the use of limit orders, but the same thing happened a few weeks ago with strategies that used neither. We discovered this in response to a research question that wanted to know if Renko bars were better than Range bars on our strategies. So, we created backtests for all strategies using Renko bars instead of Range.
While we were ultimately pleased with the results, it became clear that some of the strategies were operating outside of the parameters within the backtest, which is odd when comparing two months of market replay data against one year of backtest data (which we did). We ended up finding two strategies that made us over $2,500 last week (on 1 NQ contract) and we talk about those two strategies in the results, but we were still a bit concerned about the accuracy of the results.
I’m not going to share the full response with everyone, but I will share the final takeaway: The best way to improve backtest accuracy is to improve backtest/historical data; in particular the granularity of backtest data. After much research, these are the most salient points discovered regarding the discrepancy between backtest data and the actual way in which market data plays out:
In NT7, all backtest data is calculated on bar close even if the strategy is set to calculate on each tick in the current bar. Additionally, the previous bar only contains a limited set of data — open, high, low, close (OHLC).
Our research shows that this methodology is not specific to NinjaTrader. Indeed, this is the most popular method for developers to use because it’s a smaller data set. In other words, only the OHLC of the bar is saved as historical data in the backtest and not each tick that made up the bar. We thought that comparing backtest data from three separate platforms would add validity/accuracy to our data, but have found that all three platforms are programmed to use the same methodology.
This does not mean that backtests that use limited data are wrong, but calculated bars like Renko, Range, Point and Figure, and Heikin-Ashi are going to be more prone to large discrepancies due to their inherent nature in bar formation/calculation. Likewise, long-range data series (i.e., over 5 minutes, over 18 range, over 1000 tick) are also going to be more prone to large discrepancies.
Testing a strategy with Market Replay data is the most accurate way to assess the performance of a strategy as Market Replay plays back the market action just as it had occurred. The issue is that Market Replay does not allow for optimization and is slower than running a backtest (at least in NT).
Now that we know, what can we do?
Models are inherently flawed: what can we do about it?
An automated strategy is a model. It is an attempt to model out a successful journey through a dark and scary forest. There are many reasons why one model might be more useful than another.
A few posts ago, we talked about overfitting (optimization bias), which means a strategy is over-customized to the time period it covered. We optimize a strategy to see if we can exploit an edge found in backtesting. As a quick note, going forward we’re going to mark optimized strategies. We’re also going to add a time period test to optimized strategies for validation. You can read more about those changes here. In truth, however, as the title of this post says, overfitting is only the beginning. There are many reasons that a strategy may not be accurate and we’ve been doing a lot of research — both in-house and by reaching out to our network — for help and assistance to find out what we can do about it. As a result, we're making some drastic changes to our process:
Data latency - the speed of data transfer has always been an issue in trading. It was an issue for Munehisa Honma, the father of rice trading, and it is an issue now. So much so that it continues to be an area of focus for hedge funds, especially those than can gain an edge (click here for an interesting story of an algo scam which relied on data latency). What are we doing to help combat the issue: we are implementing a new rig (all of us have committed to increasing our CPU/Ram Speed by switching to 4.2 ghz; 4 core and 3000DDR4; HD (possibly M.2 Nvme) over SSD; and 32gb. Don’t ask me what all that means — it was a copy and paste — but it’s what our tech guy recommended so we’re doing it.)
Poor coding - we are looking at different platforms to see how backtest data is calculated. This process will take a bit longer, but it’s in the works. Meanwhile, we’re also deleting the metadata around historical market replay / SIM trades at the end of each day by deleting the NinjaTrader.sdf file.
Limited/inaccurate backtest data - this is a big one and the one we’re doing the most about. We're going to upgrade to NT8 over the next month because it gives us the ability to add intra-bar granularity by adding a second data series (such as a 1 tick or 1 second series using AddDataSeries) so that the strategy or indicator contains the individual ticks/seconds in the backtest/historical data for use in the primary series. This allows for increased granularity and accuracy by supplying the correct price at the correct time for the order to fill with. We had planned on making the move to NT8 at the beginning of 2021, but the discrepancy between backtest results and market replay data while testing Renko bars has created a greater urgency. The September update will be based on NT8. For the July 7 update, we will provide updated backtest results for comparison, however, we will also provide two months of market replay data. We’re also testing out several other platforms, including those with varying programming languages as it seems C++ might be better than C# in terms of accuracy.
Black swan reconciliation - this is when the exchanges change historical pricing due to an unexpected trading event that gapped the price, but it isn’t really a concern for futures, especially NQ. At any rate, stops should eliminate the problem.
Spreads/slippage - this tends to only be a problem with assets that don’t have a deep market. This is not the case with NQ, our primary contract.
Forward looking bias - this may be an issue for statistical models, but this is not an issue for traders. If you have knowledge of a trend, you should use it. Or, you should create an awareness of the trend as a conditional in the model. The market is dynamic and constantly adapting to new information — some may refer to this new information as noise. If we span out, we can see observable patterns and these are the patterns we hope to exploit. In other words, we do not want to dismiss a strategy that only does well in markets that are trending up if markets have been trending up for the last six months. Your bias is warranted in this case. I can’t imagine the head trader of a morning call telling everyone to trade without any awareness of the larger trend. In other words, you should use your bias/knowledge. Then the goal is to find the strategy that trades the noise within the trend. It could be something as simple as the long leg of an MACD cross. In this way, you are effectively playing god. You tell the strategy the condition in which to trade, but you give it free will to trade within that condition.
In other words, the reliability of backtests is about more than overfitting, especially on strategies that haven’t been optimized.
We hope to implement most of these changes within the next two weeks. Again, backtest/replay accuracy is our primary goal for now and any suggestions you have to that end will be prioritized over all other research requests.
Conclusion: conquering the illusive in hopes of finding the elusive
The goal of this post was to detail what we’re doing to focus on backtest accuracy — this is one aspect of what makes the holy grail of trade strategy so illusive. After much research, it is clear that accurate backtest data may be as illusive to the trading community as the holy grail of trade strategy itself. It is our hope that by focusing on backtest accuracy as the foundation of our process, it will make the holy grail of trade strategy less elusive. I feel highly confident in our efforts.
Look for an update on Research Requests to be posted with the July 7 update.
Click here for links to all strategies.