Testing the Noise-Value Strategy

PDF Version: P123 Strategy Design Topic 2C – Testing the Noise-Value Strategy

Topic 2A spelled out an idea for a strategy based on the notion that in the stock market, Price = Value + Noise, or P = V+N, as well as a plan for how we articulate the strategy. The details were spelled out in Topic 2B. Now, we’ll move on to testing.

98222087-216e-4fa0-91e7-e6236a9b8484

Setting Expectations 

The process of testing and simulating on Portfolio123 is well covered in the Tutorials available in the Help section and need not be reiterated here. The important issue for now is the expectations we have, what it would take to cause us to judge the model successful.

It all starts with recognition of what we are not trying to do.

  1. Unlike academicians, we are not seeking to define elements of theory. We are not looking for universal truths. Instead, we are working with already-established and accepted theory and are looking to see
    1. If we did a reasonable job in translating it to Portfolio123 language
    2. If there is some indication we might be able to use it to successfully invest money in the near future
  2. In contrast to what is needed in many areas involving quantitative research, we are extrapolating, not to interpolating
    1. We are testing in Universe A (defined by the data available to us for testing), a Universe in which all characteristics are knowable with complete certainty, the only issue being which ones we work to observe. But we seek to apply our model in Universe B (the future), a completely unknowable universe the vital characteristics of which may or may not differ, even differ substantially, from what we saw in Universe A.
    2. Consider, for example, chocolate. It’s known to be edible by humans. But that does not mean we can give chocolate to dogs. Our knowledge of what is safe for humans to consume does not apply to dogs because their digestive systems have different characteristics. Our ability to transfer knowledge from one population to another must be guided by a more general sense of commonalities (i.e. we know humans and dogs must eat things that are compatible with their respective digestive systems, and we know both must be properly hydrated.)
  3. Robustness, in a statistical sense, is not our goal because all we learn from that is how effectively we have modeled the characteristics of Universe A. We cannot take our knowledge into Universe B because it tested successfully in Universe A. We do so because what we learned in Universe A supports an assumption we had, even before we started testing, that an idea already known to be sound (based on the theory) was properly translated into Portfolio123 language. Also . . .
  4. We must be aware of subsets of Universe A, mainly those that we believe will most closely resemble Universe B
    1. That raises a presumption in favor of the recent past being more relevant than earlier years (the presumption can, however, be refuted by evidence you see emerging as you watch unfolding developments).
    2. Use other subperiods as they seem relevant. For an example, an income models should be particularly tested against mid- to late-2013, the time of the market’s “taper tantrum.”
  5. Do not expect to eliminate randomness. That can never happen.
    1. The theoretical framework for defining influences on return is:

R = a + b1x1+ b2x2+ . . . + bNxN+ e

e is the ever-present error term that is presumed to be random and non-zero. Models built on the basis of this framework aim to minimize e but recognize that it cannot be eliminated. Diversification (multiple stocks within a single portfolio and use of multiple strategies) is the way we try to diminish e.

  1. We are not aiming for a stupendous outcome. With live money, it’s very hard to achieve any positive alpha so anything above zero is excellent. Realistically, it’s probably impossible for humans to completely banish hindsight, so alphas up to around 20% are probably feasible in simulations. But as we rise above that, we need to start worrying that we tilted the balance too far in favor of what worked in Universe A and whether we let our attention wander away from the theoretical bridge we’ll need in order to cross over into Universe B.
  2. We’re not worrying about Max Drawdown for as long as the computation would cover 2008. I’m not expecting this or any other model to protect me from a major liquidity crisis that threatens the viability of the financial markets as a whole. If I fear that (whether through judgment or a timing model in which one has confidence), I need to be out of stocks or have short exposure. Given that Ben Graham eliminated 1932 from his analysis for similar reasons, we’d be in pretty good company if we choose to overlook 2008.

We are trying to do this:

  1. We’d like to see some positive alpha. But don’t spend much time analyzing it. A quick glance, no more than a second or two, is more than ample.
  2. Actually, though, any result is acceptable to us if we understand why it happened, whether through specification of a model that didn’t do as good a job as it should have in eliminating data oddities that often crop up in individual companies, or circumstance in the market that give us reasons to expect a particular sort of performance, etc.
  3. We’d like to be aware of volatility (ex 2008), not so much for the details but to get a sense of whether it seems to high for our real-money comfort level, or whether it’s low enough to give us reason to give this strategy more rope if we start to fear market turmoil.
  4. It’s OK to observbe Sharpe and Sortino in a very general sense; i.e. to be aware of potentially exceptional numbers. But don’t put too much into them since they are, after all, not indicators of inherent risk but statistical report cards showing what just so happens to have happened during the test period. (Actually stocks with very high historical Sortinos typically get them because of characteristics that could lead to extreme losses in the future if things break badly for the companies, and vice versa.)

The first test

I’ll start with a quick test under my standard default choices; Max time period, 4 week rebalancing. Here are the results:

Avg. Annl % return 15.33 % vs. 7.44% for benchmark (iShares Russell 2000 ETF)

Standard Deviation 21.20 % vs. 19.89% for benchmark

Maximum DrawDown 62.86 % vs. 58.06% for benchmark

Sharpre 0.67 vs. 0.35 for benchmark

Sortino 0.94 vs. 0.47 for benchmark

Beta  0.87

Annl Alpha 8.63%

That’s fine; easily good enough to move on.

Next I’ll do a quick odd-even test (I’ll add a rule that says EvenID at the end of the screen to limit results to stocks with even ID numbers. Then, I’ll repeat with Even ID=0 to limit results to odd numbered companies.

Even test

Avg. Annl % return 9.15% vs. 7.44% for benchmark (iShares Russell 2000 ETF)

Standard Deviation 21.85 % vs. 19.89% for benchmark

Maximum DrawDown 70.13% vs. 58.06% for benchmark

Sharpre 0.42 vs. 0.53 for benchmark

Sortino 0.57 vs. 0.47 for benchmark

Beta  0.91

Annl Alpha 2.91%

Odd test

Avg. Annl % return 14.60 % vs. 7.44% for benchmark (iShares Russell 2000 ETF)

Standard Deviation 21.00 % vs. 19.89% for benchmark

Maximum DrawDown 65.52% vs. 58.06% for benchmark

Sharpre 0.65 vs. 0.53 for benchmark

Sortino 0.91 vs. 0.47 for benchmark

Beta  0.84

Annl Alpha 8.10%

Not great. While the lesser even-numbered subset generated positive alpha and might even be considered OK standing on its own, comparison with the odd group reminds us that the e term is for real here.

Is it enough to kill the model? Maybe. Maybe not. I’ll postpone the decision until the end of the process, where it will be considered along with everything else. One thing I will not do is try to revise the model to get the odd and even subsamples to be more similar. Again, this is only Universe A. I’m not looking to perfect that. I’m just looking for feedback as to whether or not the core idea (selecting low-noise small caps that show a potential catalyst for extra gains as additional noise enters into the stock price) is worth using in Universe B.

Next I’ll do a set of 4-week rolling backtests with each run starting a week apart. The time frame is 1/2/99 – 10/26/15 (873 samples). Here are the results.

Average of all periods: excess over benchmark = 0.62%

Average of up periods: excess over benchmark = 0.10%

Average of down periods: excess over benchmark = 1.40 %

This is interesting. It works on average for all periods, but is obviously better, much better, during down periods.

Does this make sense? I didn’t think about it before, but on reflection and now that I see it, I can imagine the Street being less interested in hunting out under-appreciated (i.e. low noise) situations when it’s really not necessary to put forth the extra effort (i.e., when the market as a whole is good). There’s more motivation to dig for stocks such as these when there is no rising tide to lift all boats.

Let’s do a quick odd-and-even here too.

Even

Average of all periods: excess over benchmark = 0.47%

Average of up periods: excess over benchmark = 0.12%

Average of down periods: excess over benchmark = 1.01 %

Odd

Average of all periods: excess over benchmark = 0.72%

Average of up periods: excess over benchmark = 0.19%

Average of down periods: excess over benchmark = 1.50 %

For better and worse, it confirms what we’ve seen. There is randomness here, but it’s not blowing up either subset. Interestingly, the down-versus-up market quality shown in the first rolling test is holding up.

Most important, positive results across the board, even despite the impact of the e term, is consistent with expectations. After all, the strategy grew from financial logic. It should have worked. So far, it looks as if our execution (our definition of strategic elements and our translations to Portfolio123 language) was acceptable.

Still, even the best strategies don’t work equally well in all time periods. This is inevitable. We can’t control the market, nor can we control the way the market chooses, at various times, to react to particular factors. Over a prolonged time periods, good factors should work more often than not. The most patient among us can rest on that. But if we can develop insight into whether we think a strategy is more likely to be hot or cold near term, why not use that knowledge.

Testing Some Sub-periods

I’m not going to test the early 2000s because that was the golden age of quant factors, just after the capability to invest this way became widespread but before it got sufficiently prevalent to crowd many trades. Even mediocre ideas looked great in those days.

I’m also not going to consider the 2008 drawdown. As discussed above, in connection with drawdown, that was not a fundamentally driven collapse. It was a financial-liquidity crisis. If anything, better stocks sold off more intensely because those were assets for which distressed funds could get credible bids.

Here’s a sample of some more recent periods:

5-year

Avg. Annl % return 20.15% vs. 12.10% for benchmark

Standard Deviation 16.73% vs. 15.76% for benchmark

Maximum DrawDown 22.54 % vs. 27.10% for benchmark

Sharpre 1.12 vs. 0.74 for benchmark

Sortino 1.60 vs. 1.02 for benchmark

Beta  0.93

Annl Alpha 8.22 %

That’s fine. Let’s quickly move on.

3-year

Avg. Annl % return 21.54% vs. 14.31% for benchmark

Standard Deviation 12.92% vs. 13.10% for benchmark

Maximum DrawDown 12.96 % vs. 15.47% for benchmark

Sharpre 1.40 vs. 0.94 for benchmark

Sortino 1.89 vs. 1.27 for benchmark

Beta  0.82

Annl Alpha 8.33%

That’s also good. We’ll keep going.

1-year

Avg. Annl % return 3.70 % vs. 6.10% for benchmark

Standard Deviation 11.46 % vs. 12.12% for benchmark

Maximum DrawDown 18.10 % vs. 15.47% for benchmark

Sharpre -0.77 vs. -0.44 for benchmark

Sortino -1.34 vs. -0.59 for benchmark

Beta  0.56

Annl Alpha -5.72 %

Well, if we were starting to nod off, that should have served to jolt us awake. Now, with our rejuvenated sense of alertness, we’ll make one more standard run – but with a sense of humor and an understanding of how whacky some metrics can look when data for short periods is annualize

6-month

Avg. Annl % return 5.80 % vs. -3.82% for benchmark

Standard Deviation 0.28% vs. 1.53% for benchmark

Maximum DrawDown 9.46 % vs. 11.86% for benchmark

Sharpre -109.98 vs. -45.93 for benchmark

Sortino -155.53 vs. -64.95 for benchmark

Beta  -0.18

Annl Alpha -35.65 %

Looking at all of them as well as the graphs (which you can reproduce if you choose to copy the model into your account) we see that generally the model held up in this recent market period. We had one very notable exception when the portfolio started to fall a few months before the Russell swooned. But then, after bottoming, the portfolio bounced harder upward.

There has, of course, been talk of value not having been good lately. But we’re portfolio123. Talk alone doesn’t cut it.

I did a one-year backtest of a random screen against two different benchmarks, the S&P 1500 Pure Growth Index, the S&P 1500 Pure Value Index and the S&P 1500 Composite. (I use the S&P 1500 because I want to look at value-versus-growth, independent of market cap effect.) I don’t care what the screen did. I’m looking only at the benchmarks.

Growth Total Return: 9.18%

Value Total Return: -2.55%

Composite Total Return: 5.20%

That’s our answer to the strategy’s rotten 1-year backtest. Value (an exposure we take on when we look for stocks whose prices reflect below-normal levels of noise) was out of favor; fort real – it wasn’t just talk

Well that stinks. But there’s nothing we can do about what’s in or out of favor in the market at any point in time. All we can do is decide we like our idea well enough to ride out cold periods, or go with other ideas. What we should not even think about doing is adjusting the model such as to get a better one-year backtest.

Also interesting is a one-year comparison between the strategy, the S&P 600 (small cap) value index and the iShares Russell 2000 Value ETF (IWN):

Strategy: +3.70%

S&P 600 Value: -3.36%

IShares Russell 200 Value: -0.06%

Yes, we had exposure during the past year to a crummy part of the market, value. But at least we were the cream of the crop.

This is an important approach to assessing odd periods in a backtest. Benchmarking is a large and complex field of activity. Experiment with as many Portfolio123 pre-set benchmarks as you find relevant. And if you can’t find what you need, create single-rule ETF screens (Ticker”XXX”) and backtest those. Or create and backtest your own screens that can serve as custom benchmarks. Until such time as Portfolio123 is able to add more on-platform benchmarking, download the return sets from these ad-hoc backtests and make your comparisons in Excel (often, though, even eyeballing, or comparisons performance result screen shots that you make, will be fine).

Looking at all of this, I’ll give the model a green light. I know the odd-even tests weren’t what I wish they were. But enough else that I’ve seen makes sense in combination with the most important (by far) consideration, my comfort with the theoretical ideas upon which the model is based, for me to be willing to take this model into Universe B (and hope value doesn’t stay in the doghouse forever, or hope small-caps don’t turn sour – always understand how and why a model, no matter how well conceived and tested, can still falter when applied to live money).

Actually, though, there are still some details we need to address.

The Rebalancing Period

So far, everything we did presumed my default choice of a four-week balancing period. I always check to see how things fare with one-week or three-month rebalncings.

  • One week has the benefit of fresher data, but may wind up being too quick to sell stocks that needed more time for their investment merits to play out in the market.
  • Longer periods sacrifice data freshness, but we can benefit by giving our ideas more time to play out.

Information travels faster than ever, but much that we use is updated quarterly and in all cases, investor interpretation of and reaction to information takes time to evolve. Not having intra-day data, we are not likely to be able to constructively use strategies that depend on speedy human or electronic response. So I usually favor the longest feasible rebalancing period. Based on the quarterly update cycle, I tend to set my maximum rebalancing choice at 3 months. (Note that academic research often uses one-year rebalancing.)

Here are MAX-period tests of 20-stock portfolios under different rebalancing scenarios:

1-Week

Avg. Annl % return 15.43 % vs. 7.44% for benchmark (iShares Russell 2000 ETF)

Standard Deviation 21.40 % vs. 19.89% for benchmark

Maximum DrawDown 63.80 % vs. 58.06% for benchmark

Sharpre 0.67 vs. 0.35 for benchmark

Sortino 0.94 vs. 0.47 for benchmark

Beta  0.86

Annl Alpha 8.83%

4-Weeks

Avg. Annl % return 15.33 % vs. 7.44% for benchmark (iShares Russell 2000 ETF)

Standard Deviation 21.20 % vs. 19.89% for benchmark

Maximum DrawDown 62.86 % vs. 58.06% for benchmark

Sharpre 0.67 vs. 0.35 for benchmark

Sortino 0.94 vs. 0.47 for benchmark

Beta  0.87

Annl Alpha 8.63%

3 months

Avg. Annl % return 10.07 % vs. 7.44% for benchmark (iShares Russell 2000 ETF)

Standard Deviation 20.91 % vs. 19.89% for benchmark

Maximum DrawDown 58.41 % vs. 58.06% for benchmark

Sharpre 0.45 vs. 0.35 for benchmark

Sortino 0.64 vs. 0.47 for benchmark

Beta  0.84

Annl Alpha 3.74%

There’s negligible difference between 1- and 4-week rebalancings, but we clearly give up something when we go to 3 months – although not much. We could live with that if need be.

Before locking in on anything, let’s look at the next issue: Number of Positions

First things, first: I won’t even test an equity model with less than 10 stocks. (ETF models can go as far down as one.) I don’t want to see how smaller numbers of positions performed. I don’t care. Nothing such a test could show could possibly persuade me to invest live money in portfolio with a small number of stocks. To do that, I’d need to be doing significant off-platform analysis (which I sometimes do – everybody needs some fun investments — but really, over time, my experience with Portfolio123 models has been better).

Here’s the problem with small models.

We face huge risks of the inadvertently mis-specified model. Data doesn’t always tell us what we think it tells us. A low P/S ratio may be low because the market is missing out on something good. Or it may be low because sales in part of the TTM period are inflated by a big acquisition. As previously discussed, my models are screen-heavy as I try to create the most representative mini-universes I can. But there is just so much you can do with screening before you narrow your result set to zero.

Portfolio diversification is another approach. Simply put, I want the portfolio average metrics reflect my idea as closely as possible. Choosing any one stock is too iffy in that regard; it may be great on one metric and terrible in others. My experience suggests that 10 positions is the smallest portfolio size I accept. Less than that results in excess exposure to data oddities. I’m not seriously arguing against 9 positions. We all have to draw a line somewhere. So sometimes, we humans have just as much “e” in us as does the market.

The upside limit is trickier. As many point out, the more stocks you have, the closer to average you necessarily have to be. So a 1500 stock portfolio is going to find it hard to beat the Russell 2000. We also have to consider trading costs. Trading as I do on a super-low-cost platform (Folio Investing), I find that 20 is often a comfortable starting point. But again, based on experience, and the good old random “e” factor inside me, I’ll also test 10 positions and 30 positions.

Here are the MAX-period position tests using 4-week rebalancing intervals:

10 stocks

Avg. Annl % return 15.80 % vs. 7.44% for benchmark (iShares Russell 2000 ETF)

Standard Deviation 23.18 % vs. 19.89% for benchmark

Maximum DrawDown 66.73 % vs. 58.06% for benchmark

Sharpre 0.65 vs. 0.35 for benchmark

Sortino 0.93 vs. 0.47 for benchmark

Beta  0.86

Annl Alpha 9.63%

20 stocks

Avg. Annl % return 15.33 % vs. 7.44% for benchmark (iShares Russell 2000 ETF)

Standard Deviation 21.20 % vs. 19.89% for benchmark

Maximum DrawDown 62.86 % vs. 58.06% for benchmark

Sharpre 0.67 vs. 0.35 for benchmark

Sortino 0.94 vs. 0.47 for benchmark

Beta  0.87

Annl Alpha 8.63%

30 stocks

Avg. Annl % return 12.42 % vs. 7.44% for benchmark (iShares Russell 2000 ETF)

Standard Deviation 20.54 % vs. 19.89% for benchmark

Maximum DrawDown 65.58 % vs. 58.06% for benchmark

Sharpre 0.57 vs. 0.35 for benchmark

Sortino 0.79 vs. 0.47 for benchmark

Beta  0.85

Annl Alpha 5.94%

Portfolios of 10 and 20 positions seem largely indistinguishable. It’s interesting to see that at 30, my idea starts to dilute (but only a little bit; the results for 30 are perfectly fine viewed on their own).

Settling in on a final model

I could run more tests with different sub-periods and different ways to mix and match reblancings and positions. But there’s no virtue in over testing. As noted, the dominant consideration here is the sensibility of the original idea. The more we test, the further away from that we get and the more danger we take on when we commit real money. There are no brownie points for going with the variation that produced the highest backtest result.We don’t know if the top result will translate into the top live-money performance. But we do know that the process of searching for the best result and the mindset that locks in on it is a slippery slope that leads  us down into the realm of curve fitting.

So at this point,  looking at what I have, I’m perfectly prepared to go forward with my default 4-week rebalancings and with 20 positions. That’s fine for me.

But just for the heck of it, I wonder about 1-week rebalancing and 10 positions.

Here’s why: Even though 4-weeks was fine in test, I can’t help thinking about the fact that I’m using analyst data in my ranking system. When working with that in such an important way, I crave fresher data – even though I don’t have tests that prove its superior efficacy. But if I drop the rebalancing down to a week, I really would like to cut my position count. Maybe I’ll want to share this with somebody who doesn’t use Folio Investing and might care more than I do about the number of trades.

I’ll run one more MAX-period test with 10 positions and 1-week rebalancing. If it’s better than or not materially worse than 4-weeks and 20 positions, I’ll go with the former. (It’s fair to use judgment this way; again, we’re projecting forward so there are no brownie points for locking in on the single best set of test results).

Here is the result:

Avg. Annl % return 16.63% vs. 7.44% for benchmark (iShares Russell 2000 ETF)

Standard Deviation 23.84 % vs. 19.89% for benchmark

Maximum DrawDown 68.06% vs. 58.06% for benchmark

Sharpre 0.68 vs. 0.35 for benchmark

Sortino 0.95 vs. 0.47 for benchmark

Beta  0.89

Annl Alpha 10.44%

How about that! If there were brownie points, I’d get them (well, maybe not, who knows how many unstudied variations might show even better results). In any case, though, as they say in Hollywood . . .

It’s a wrap (10 stocks and weekly rebalancing).

Post Script – What About Simulation/Portfolio

I always do my grunt work in the screener because the interface more easily lends itself to stating ideas and revising and quickly retesting. If I want to get to a live Portfolio, I’d have to make some changes and test them in sim.

First, I need to change every use of ShowVar to SetVar.

ShowVar is designed to show items in reports. That’s important to me when I create such models because it makes it easy for me to debug. The report-column-generating functionality is, however, irrelevant to simulation and portfolio. Hence Show Var is not a legal function in that part of Portfolio123.

SetVar gives you what you need for simulation and portfolio. It executes the same logic, but keeps everything where you need it, behind the scenes. SetVar never shows anything in reports (so if you rare confident enough in your @variables, you can even use SetVar in the screener, you just won’t be able to de-bug).

Also, while you can have sim/port replicate the screeners selling protocols (use Rank<101 as a sell rule and in general settings, allow sold positions top be repurchased at rebalance), many prefer to avid doing that. This spares the model the need to make a bunch of mini-trades get all positions back to equal weighting at rebalance.

So when you bring screen-based models to sim, create some sell rules (even if they are mirror images of your Buy rules; at least doing that spares you the burden of the re-weighting-mini-trades). Beyond that, it can be helpful to add some tolerance on the sell rules. In this model, for example, I might want to let the stock keep running for a bit, to let noise build, perhaps to 40%-50% of market cap. And maybe I’d want to blow a stock out right away if there is an estimate cut. So put in your sell runes, and then, re-test in the context of sim. If you need to revise, do so thoughtfully and stop when something looks OK. And don’t over-interpret the sim. If your first trial is OK, think about stopping. Avoid the temptation to play with little variations in an effort to juice the results. Remember . . .  there are no brownie points.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s