While it’s a little late, here’s part two of a breakdown of how predictive combine performances are of success in the NFL, from our friends at the Sports Analysis Collective.
Yesterday I made a bunch of claims about the NFL Combine. Now it’s time to back it up. For quantifying performance, I explained why I will be using the 3 Year Approximate Value (3YAV). Just think of this as a crude measure of how good a player is in his first three years in the league. So for each player1, we have 8 combine numbers (height, weight, 40 yard dash, bench press, shuttle, broad jump, vertical leap, 3 cone drill), their position according to the combine, and their 3YAV.
Our goal then, is to predict 3YAV given the combine data. Before I get into the technical details, here’s a teaser that summarizes much of the results. The next section explains the calculations behind it. If you’re not as interested in the math, you’re missing out, but just pay most attention to the bolded stuff.
We proceed by building a model for each position that tries to best predict 3YAV using the combine. But there are a couple of problems in directly trying to estimate the 3YAV itself. First, it has a very skewed distribution. There are a lot of players with a low 3YAV and then a few outliers (all-pros) with very high 3YAV. A standard regression on this data would be heavily influenced by the outliers, leading to uncertain estimates of the importance of the combine drills as well as bad prediction power for future players. Also, predicting the precise 3YAV might not be what we’re looking for anyway. In going into a draft, you’re often more concerned about the ranks of players. You want to be able to say who is the best, is this guy better than this guy, etc. For these reasons, I will try to predict a player’s percentile in terms of 3YAV. A player’s percentile is easy to calculate: it’s just the percentage of players in his position with a lower 3YAV.
Alright, so for each position, we are going to build a model to predict 3YAV percentile. Given the type of problem and the amount/quality of the data, a linear regression is a solid choice. A linear model consists of a set of coefficients, one for each of the input variables (forty, bench, etc.), and the final prediction is a linear sum of the input variables times their coefficients. Mathematically, the model looks like this:
The key point is that each combine measurement will have a number associated with it that estimates how successful you will be based on that measurement.
Instead of doing an ordinary least-squares regression, I am going to do a ridge regression (see the “Math Stuff” section for more details). In a high dimensional problem with limited data, regularization is absolutely necessary. If that didn’t mean anything to you, don’t worry about it. Now that we chose our inputs, outputs, and model we can go ahead and calculate everything.
The estimated coefficients for each of the different positions are shown below. An asterisks above a coefficient indicates that it is statistically significant, i.e. can’t be explained by chance. For the graphs, the vertical axis is in terms of 3YAV percentile. The coefficients are in terms of standardized (z-scored) inputs. It’s easiest to explain what this means with an example. In the top graph (for centers), you see that the coefficient for weight is about 4.4. This means is that if you are a center and your weight is one standard deviation more than average, you can expect to be 4.4 percentile higher in 3YAV. As a reference, the average weight for C’s at the combine in the model is 301 lbs and the standard deviation is 8 lbs (there is a table with all these numbers at the end). When you see a negative coefficient, it simply means a decrease in that measurement is associated with more success (e.g. decreasing your forty time). Each coefficient has an error bar which shows the uncertainty in its estimate.
- Weight is a significant factor for all lineman. Height on the other hand is only statistically significant for C’s and still only about half as important as weight.
- Bench is only significant for guards. Interestingly, the coefficient for bench actually came out negative for centers, although it’s not significant and likely reflects noise/codependence with other factors.
- Forty is a significant and important factor for OG’s and OT’s, but not C’s.
Offensive Skill Positions
Note: not many QB’s participate in the bench, so it was left out of calculations.
- None of the coefficients came out as significant for WR’s. In fact, WR’s are the only position in which the model can’t significantly predict success. This is somewhat surprising. Maybe route-running abilities really are more important than raw athleticism. Also there are different types of receivers in the NFL ranging from smaller, more agile slot receivers to big, strong possession guys. So maybe there just isn’t a simple linear model that account for this variability in predicting success.
- Height is not more important than weight for QB’s (at least within the first three years in the league). The forty, shuttle, and vertical leap are all statistically significant. It is interesting that athleticism makes that much of a difference. But again, this may partly reflect that we’re looking at the first three years. Athletic QB’s might be better able to make up for their rookie mistakes and are better suited to be successful on day one.
Defensive Lineman and Linebackers
- Weight and the forty are again strong factors.
- The bench is only significant for DT’s.
- The forty is very significant for CB’s, as you would expect. But what is interesting is that CB is one of the few positions where the bench is actually significant.
- It’s difficult to predict the success of FS’s from the combine alone. Although, the coefficient for vertical leap is larger for FS’s than any other position. Gotta get those jump balls.
Phew…that was a little bit of graph overload, so let’s summarize it all in one plot. In the graph below, the vertical axis corresponds to position and the horizontal axis corresponds to combine measures. The color illustrates how relatively important a measurement is for predicting success at a particular position. “Importance” here is quantified as the absolute value of the regression coefficient.2
Vertical patterns indicate the importance of a measurement across positions. For instance, the forty column is very bright, meaning that the forty is important for many positions. Horizontal patterns indicate the importance of the combine overall for a given position. The WR row is very dark, showing that performance in the combine doesn’t seem to translate to success on the field. We can quantify these notions by summing up over the columns or rows and comparing different factors/positions. These plots are shown below.3
We see that the forty, weight, and three cone have the highest overall importance across positions, while bench has the smallest. Note however, that the error estimates for all factors are still relatively large due to the noisy, limited amount of data.
Besides WR and FS, the combine has fairly constant importance across positions (relative to the amount of noise in the estimates). However slightly, the positions we’d expect to see at the top are indeed there, with DE, OLB, and CB having the highest estimate of importance.
As another summary, here is a graph showing which variables are statistically significant for which positions.
Going beyond statistical significance and relative importance, we actually want to know if the combine makes a real difference. A coefficient could be statistically significant, but, in the real world, might not be big enough to matter. So, we need to look at themagnitudes of the coefficients. The plots above show that the most important variables have coefficients in the range of 3-5 percentile per standard deviation. That might seem small, but in perspective, it’s really not, and remember, that’s for only one drill. If you get faster, chances are you’ll improve in your forty, shuttle, three cone, etc. The total sum of coefficients per position is in the range of 13-17% for most positions. What can a jump up of 13 percentile get you? Well out of the players in the study, about 1/3 had a 3YAV of 0, meaning that they didn’t make it at all. The next third or so had 3YAV’s that are consistent with backup players. The top third could be thought of as becoming starters, with the top 7% reaching the pro bowl within their first three years. With perhaps a slight abuse of correlation vs. causation, if you can increase your abilities such that you go from an average combine performer to a pretty good one, your expected success would bump up about half a level. If you’re projected to be an average backup, you could become a starter. If you’re a starter, the bump might take you to pro bowl level. Think about how much that could mean individually, as well as from a team perspective; monetarily to actual wins/losses.
So far I have talked about the model itself, but how accurate is it in actually making predictions? First, we need to define accuracy. Typically, accuracy is measured via r2 (% of explained variance) or Pearson’s r (correlation coefficient). But since we care more about therank of players, we should evaluate our accuracy the same way. Spearman’s rank correlation, denoted as rs, is a way to measure this. Spearman’s rank correlation is the normal Pearson’s correlation coefficient between the actual rankings and predicted rankings of the individuals. Its value is somewhere between -1 and 1. If you’re used to dealing with the usual Pearson r, think of its value the same way. And if you’re not used to dealing with either, I’ll make it concrete in a second.
Below is a graph showing the rs for each position. The values are cross-validated, meaning different data were used to fit the model than to evaluate its accuracy. Cross-validation should always be done, otherwise you’re cheating. The positions in which the model best predicts performance are DE, CB, TE, and OLB. (Although, again, the error bars are significant). As mentioned before, FS and WR success is hard to predict from the combine. For WR, the rs actually came out to be slightly negative, which can happen because of the cross-validation.
These values are still abstract. We need something to compare against. What can we use as an ‘expert’ model of predicting success? How about the actual draft itself? A team will draft a player over others because they think he will give them more value (i.e. 3YAV) than the other players left on the board (at least in the same position). Using draft pick as the ‘expert’ measure isn’t perfect4, but it’s pretty darn good.
Below is a graph comparing the rank correlation (accuracy) in predicting 3YAV by the draft compared to the combine model. Overall, the model does pretty well, considering it only looks at 8 numbers from one week in a player’s life. The average accuracy as predicted by draft pick is around 0.7, whereas the model gives ~0.35 for positions in which it is significant. It’s far from perfect but getting a value of about half as much as the experts from the combine measurements alone is pretty impressive if you ask me.
If you made it this far, give yourself a pat on the back. There was a lot packed in there, but hopefully it was interesting. Here is a brief recap:
- A linear model on combine data can significantly predict future NFL success, except for WRs.
- The forty, weight, and 3 cone drill are the overall most important measurements, although there is variation across positions. The bench press is the least important.
- A decent improvement at the combine won’t take you from 3rd string to super star, but it could take you from 2nd string to starter, starter to pro bowl, etc.
Cool stuff. Tomorrow I’ll fit a model to combine data again, except instead of predicting player success, I’ll try to predict where they will be taken in the draft. This will allow us to see what factors teams are implicitly valuing when making draft choices, to which we can compare today’s results.
The linear model was fit using a regularization technique known as ridge regression. The cost function for ridge regression is the normal squared-loss plus an L2 penalty on the regression coefficients. Ridge regression has a hyperparameter relating to the strength of the L2 penalty. I chose the hyperparameter separately for each position using leave-one-out cross-validation (LOOCV). Training/testing was also done with LOOCV. Error estimates throughout were done using a non-parametric bootstrap.
Means and Standard Deviations per Position