READER ADVISORY: Explicit statistical content.
On Monday, in Part 1 of this series, I outlined the recent NFL history of drafting OLs in the 1st round. In a very superficial way, I also provided some crude statistical expectations for both the 49ers' and their new OLs performance over the next few years. The bottom line was that Anthony Davis and Mike Iupati will probably end up being pretty good players, but that probably won't translate into much-desired team success until 2011 or 2012.
Today, I'm going to get a bit more abstract, and try to show this seeming paradox in more statistically sophisticated detail.
WHAT-WORLD VS. WHY WORLD
I've been plodding away at this whole NFL statistics thing for about 5 years now, and to say that the field remains in its infancy is a mild understatement. What I've noticed most of all is that much of what has passed for NFL statistical analysis -- including many of the things I've posted here on Niners Nation -- are really just descriptions of what has occured or descriptions of what is going to occur based on what has occured. In other words, there's a lot of focus on the "what" of things.
The most egregious example of this what-focus has to be the inanity of stats like "The 49ers are 18-2 when Frank Gore has 20 or more carries in a game." As Football Outsiders showed a long time ago, 20 is not some magic number that draws the win fairies from their hiding places to bestow a win upon the eagerly awaiting 49ers. Why is this so inane? Well, it's because it doesn't answer the "why?" question at all. Namely, why does 20 carries influence winning? As Football Outsiders showed way back in their infancy, the answer to that question leads to the perfectly circular statement, "The 49ers win more when Frank Gore has 20 or more carries in a game because Frank Gore has more carries when the 49ers are winning." Stated differently, an analysis that initially evokes statistical wonderment in Whatworld quickly evokes laughter in Whyworld.
After the jump, we take a slow boat to Whyworld...
In contrast to the barren wastelands in What-World, there are certain bastions of good statistics that we'll call the United Sites of Why-World. One state capital in Why-World is, of course, Football outsiders. Another is the one-man-gang over at Advanced NFL Stats. In both of these capitals, the focus on "why?" is a direct outgrowth of their reliance on theories of NFL football rather than observations of NFL football patterns. The distinction I'm trying to draw here is between deductive (aka top-down) research vs. inductive (aka bottom-up) research.
Basically, in science, you can either use a general theory of something to explain/predict a specific (type of) event or you can use specific (types of) events to develop a general theory of something. The latter is part of Whatworld, whereas the former is part of Whyworld. In other words, theories tell you why what you're trying to explain/predict has happened/will happen, whereas observations only tell you what happened, and it's up to you to detect the pattern of happenings.
In the case of Football Outsiders, their DVOA statistic is based on first down probability theory (From Palmer, Carroll, and Thorn's The Hidden Game of Football, 1988), which says that, because NFL rules award a first down for every 10 yards gained from the 1st-down spot, successful plays are defined as those that make a 1st down more likely, i.e., those that gain a larger percentage of yards remaining for a 1st down. Therefore, better teams are more likely to have a larger percentage of successful plays; and that explains why better teams win more often. In other words, the answer to, "Why did/will Team X beat Team Y?" is, "because Team X had/will have a higher play success rate." This basis in a time-tested theory of football is the reason why DVOA predicts wins so well.
Similarly, in the case of Advanced NFL Stats, Brian Burke's EPA statistic is based on point expectancy theory (originally by Virgil Carter, but adapted in The Hidden Game of Football), which says that, because NFL rules award 6 points for crossing the opponent's goalline while in possession of the ball, offenses can be expected to score more points the closer they get to said goalline. Therefore, each play's value is proportional to the amount that it increases the team's expected points. Add to the mix this other NFL rule about winning being the result of scoring more points than your opponents, and you end up with the idea that better teams win more often because more of their plays add to their expected point totals. Hence, Expected Points Added (aka EPA). So, the answer to, "Why did/will Team X beat Team Y?" is "because Team X's plays added/will add more to their expected point total." Again, a theory-based journey into Whyworld is the reason why EPA and is so good at predicting wins.
Keep in mind, all of this is just an illustrative way of saying that, if we're going to predict something using NFL stats, we better be prepared to explain why that predicted thing is going to happen; and starting from the viewpoint of a theory, rather than a series of observations, is a sure-fire way to have that explanation ready in advance.
BACK IN NINERSWORLD
Bringing this back to a discussion about the potential impact of the 49ers' two 1st-round OLs, the point of the above journey into scientific methodology is to emphasize the importance of going beyond, "Will Davis and Iupati improve the 49ers' win total in 2010?" to "Will Davis and Iupati improve the 49ers' win total in 2010 because of insert theory-based statistical factor here." That's what I'll be spending the rest of this post trying to do.
In my mind, among the myriad of potential explanations for why Davis and Iupati might make the Niners better as a team in 2010, the only ones that involve a direct causal link between these 2 players and team performance are
- They're going to improve the Niners' run blocking.
- They're going to improve the Niners' pass protection.
To put these in the form of a larger theory, I'll refer back to a paragraph I wrote in Part 1 (my emphasis added):
...What happened to the Jets after drafting 2 OLs in the 1st round of the 2006 draft...? Well, both started in their rookie season, and had an immediate "impact." Their pass protection improved considerably (-3.7%), which presumably led to a massive improvement in pass OFF efficiency (+47.1%), which presumably led to a 6-win improvement and a playoff berth...So, if there's a poster child for the theory adopted in Santa Clara this past Spring, it's those 2006 Jets...
In a general sense, what I was hinting at there was that, back in April, the front office was operating under the theory that picking an OL in the 1st round leads to better blocking, which leads to better OFF, which leads to more wins. Therefore, if you're going to try to make the theoretical case -- and theory is really all we have right now given the uncertainty ahead -- that drafting Davis and Iupati is going to lead to more wins in 2010 (and/or beyond), then you have to answer the "why?" question. In this post, I'm answering it with, "because they're going to make the blocking better, thereby making the OFF better; and better OFF means more wins." Here's the theory in picture form:
In this picture, the arrows mean "causes" and the plus signs -- shockingly -- mean "positive change." So, just to repeat for the sake of clarity, the general theory we're discussing here is that drafting 1st-round OLs causes positive change in blocking, which causes positive change in OFF, which causes positive change in wins.
MEANWHILE, OVER IN STATS-WORLD
Just in case you forgot, one more time...READER ADVISORY: Explicit statistical content.
Of course, without knowing it, what we've really done here is engaged in the statistical art of theory-based model-building. In this case, our model is predicting win change from drafting OLs in the 1st round. Model-building is at the heart of statistical prediction for phenomena ranging from elections to NFL game results. And forming the basis of most prediction models is regression analysis.
Those of you who are mildly familar with statistics (or have followed my posts on Niners Nation; I've used it in previous posts) will know what regression analysis is. If you don't (or haven't), see here for a wiki-explanation. Basically, all prediction models seek to predict an outcome from a set of variables that theoretically impact that outcome; and the vast majority of the time, the mathematical heavy-lifting is done via a regression analysis of some kind because that's exactly what regression analysis are meant for.
One problem with simple regression analysis, however, is that it has a hard time dealing with complex phenomena that involve multiple things causing other multiple things at the same time or that involve long chains of one thing predicting another that predicts another that predicts another, and so on. In other words, if I want to predict Y from A, B, C, and D, simple regression works great. However, if, on the other hand, I want to predict Y from A, but A predicts B, B predicts C, C predicts D, and A also predicts D, things get, shall we say, very messy.
In the context of my theory-based model above, it should be apparent that it's an example of the latter, messy case, rather than the former neat one. It doesn't say that more wins is predicted separately by picking good 1st-round OLs, improvements in blocking, and improvemtns in OFF. Rather, it says that picking good 1st-round OLs predicts improvements in blocking, which then predicts improvements in OFF, which then predicts improvements more wins. So, clearly, simple linear regression isn't a viable tool here.
Luckily, statistical innovation over the past 30 years or so has produced a type of analysis that handles this situation swimmingly. It's called structural equation modelling (SEM) or path analysis. Without getting bogged down in details (see here for said details), its main advantage is to test theories of complex phenomena usually involving multiple chains of simultaneous predictions. In essence, rather than testing one prediction equation (as in simple regression), SEM tests multiple prediction equations simultaneously. Indeed, in the field of economics -- where this kind of thing has been used for decades -- SEM is referred to as simultaneous equations modelling.
BACK TO THE REAL WORLD
So I hope I've been clear enough so far in explaining what I'm trying to do here. If not, here's the bullet-point version:
- It's not enough to just say that the addition of Davis and Iupati will mean more wins for the Niners in 2010. You have to go one step further and explain why. Because they're OLs, and the job of OLs are to run block and pass protect, the only logical answer to the "why?" question is that they're going to improve the Niners' blocking, thereby improving the offense; and that's why they're going to win more games. This was more than likely the theory that the 49ers' decision-makers were operating under when they chose Davis and Iupati in the first place.
- To test such a complex theory, reliance on one simple regression model that treats OL talent, blocking, and offensive efficiency as separate predictors of winning doesn't work because having good OL talent doesn't directly lead to more wins. Rather, it leads to more wins indirectly through better blocking, and thereby better offensive efficiency. That's a much more accurate description of the theory we're trying to test. Therefore, given a test of multiple indirect theory-based predictions, SEM, not simple regression, is the approriate statistical technique to use.
Of course, another important aspect of model-building is what's called ecological validity, which basically means that the model needs to replicate the real world as much as possible. So, in addition to the model I illustrated earlier, we need to add several things that are operating in the real world of NFL football, but were absent earlier:
- As I mentioned in Part 1, the reality is that the earlier a team's pick in the 1st round, the better OL they're going to draft. In other words, pick number predicts 1st-round OL talent.
- Per NFL rules, the worse a team's record was the previous season, the higher their pick is in the following draft. So, previous year's win total predicts pick number.
- As of yet unmentioned, the reality of the NFL is that the more wins a team has one year, the fewer they have the next year. In other words, team wins exhibit regression to the mean from year to year -- pretty substantially so, actually. Therefore, previous year's win total predicts win change from one year to the next.
- Some OLs are better at pass protection, whereas others are better at run blocking. Similarly, some offense are better at passing, whereas others are better at running. In other words, passing and running are different animals in the real NFL world. Therefore, we should treat them differently when testing the theory by testing separate models. In other words, we'll test one model where the theory is that drafting 1st-round OLs indirectly predicts winning through improvements in pass protection and pass offense; and another where drafting 1st-round OLs indirectly predicts winning through improvements in run blocking and run offense.
Add these 4 things to the earlier model, and we arrive at a pretty comprehensive theory of how drafting OLs in the 1st round impacts win change:
This is the passing model. In the running model, ALY Change replaces ASR Change and Run OFF DVOA Change replaces Pass OFF DVOA Change. However, both models are based on the general theory that having a bad record means getting a better draft pick, which means drafting a better OL, which means better blocking, which means better OFF, which means more wins; with more wins also being directly influenced by the previous year's bad record. I think that's a pretty comprehensive theory-based model that resembles the real world reasonably well.
DESTINATION ANSWERSWORLD
I'm not going to bore you with any more details of statistical methodology. If you want to know more about how I got from the models above to the answers below, feel free to ask in the comments section. The important things to know are that, in SEM, what we want to find are (a) that the model fits the data well; and, if so, then (b) that the predictions in the model are statistically significant (i.e., not due to random chance). If the model fits the data well, it means the stats support the theory illustrated in the model. If it doesn't fit well, then the theory's not supported. If the theory's not supported, then we stop there. However, if the theory is supported, then we see whether the predictions in our theory held up against statistical scrutiny.
In total, I tested 6 models using data for teams who drafted 1st-round OLs who were full-time starters in a given year:
- Passing model for full-time starters in Year 1
- Running model for full-time starters in Year 1
- Passing model for full-time starters in Year 2
- Running model for full-time starters in Year 2
- Passing model for full-time starters in Year 3
- Running model for full-time starters in Year 3
First, here's what I found for the indirect impact of drafting 1st-round OLs on Year 1 win change:
- The passing model does not fit the data well, so our/the 49ers' front office's theory was not supported.
- The running model fit the data reasonably well, so the theory was somewhat supported. However, the only prediction that was not statistically significant was that drafting better 1st-round OLs leads to better run blocking.
And here's what I found for Year 2:
- The passing model fits the data well, so the theory was supported. However, the only prediction that was not statistically significant was that drafting better 1st-round OLs predicts better pass protection.
- The running model does not fit the data well, so the theory was not supported.
Finally, here's what I found for Year 3:
- The passing model fits the data well, so the theory was supported. However, the only prediction that was not statistically significant was that drafting better 1st-round OLs predicts better pass protection.
- The running model fits the data well, so the theory was supported. All predictions in the model were statistically significant.
So, to put this all together, here's how I'd summarize my findings. First, in 4 out of the 6 models, there was at least some support for the general theory that drafting a 1st-round OL indirectly increases wins, with more support in Years 2 and 3 than in Year 1. However, only win improvements in Year 3 were predicted by specifically drafting better OLs, and only with respect to run blocking. In other words, in all cases except running in Year 3, the long, winding road from drafting an OL in April to win totals in January stopped abruptly at "Better-OLs-Means-Better-Blocking Avenue." Or, going along with the theme of this post, "Drafting Davis and Iupati will make the 49ers win more in 2010," got lost on its way from Whatworld to Whyworld.
TWO DISCLAIMERS
Given that I used a statistical technique not many people know about, I've purposefully postponed details of both the methods and the actual statistical results for the comments section. However, 2 things I'll say up front are that (a) SEM usually requires larger sample sizes than the one I used here, and (b) the models I tested obviously did not exhausted the entire Whyworld of factors at work here. However, the fact that 4 of the 6 models fit well and almost all of the model predictions were statistically significant despite the small sample sizes is de facto evidence that the theory (and the analysis) was not wholesale bunk. As far as factors I didn't account for, we can discuss that in the comments section.
BOTTOM LINE
So, if you survived that, or if you've just skipped directly to this section, here are the takeaway messages of this post:
- There's a descent amount of support for the general theory that drafting an OL in the 1st round indirectly leads to more wins through improved blocking and OFF. However, there's a slight problem. Only related to run blocking in Year 3 of the OL's career do improvements in blocking actually depend on the talent level of the OL himself.
- So, relating this back to the Niners...their fortunes from 2010-2011 will likely depend on their blocking and offense, but that's statistically likely whether Davis and Iupati end up being great OLs, good OLs, or just-plain OK OLs. In 2012, though, how good they are is going to matter statistically for the 49ers' run offense.
- I'm obviously a nerd with way too much time on my hands.