Sunday, February 9, 2014

Moneyball, Dimples Included - Part I

As a teenager in New Jersey with a, to be charitable, limited social life, I remember vividly reading this 1981 Daniel Okrent Sports Illustrated article about a young man applying new statistical techniques to the game of baseball. The article included this picture of a scruffy looking guy (though I didn't remember the suit and tie) sitting on the field of Kansas City's stadium with a complex mathematical formula on the scoreboard behind him: 


Well, that guy turned out to be this guy and the rest, as they say, is history. I immediately wrote to him and became a subscriber to his Baseball Abstract, then published in Samizdat form by his wife in their KC basement. He revolutionized the game of baseball through his use of the voluminous available data in innovative ways to puncture (and sometimes confirm) the conventional myths about the game. It should also be noted that he made the exercise fan-friendly by combining the analytics with lively, amusing writing. 

As always, the reader will wonder where I'm headed. A series of recent articles in the golfing world give the sense that we're in the early stages of a similar revolution in the world of golf. Such a period requires access to a deep reservoir of data, but unlike my 1981 baseball example the existence of sufficient computing power is not an issue. 

To further set the table, let's briefly review where we are in our understanding of the greatest game (though baseball remains a close second). We've for some time had a range of statistics available to us, and we've seen some glacial improvements in these. The best example is putting, where for the longest time we were limited to two statistics, Putts per Green in Regulation ("GIR") and number of putts, that were so inherently flawed as to create uncertainty as to whether they actually measured putting prowess. Similar, I think, to baseball's reliance on runs batted in as a measure of productivity. 

Longtime golf writer David Barrett, author of Miracle at Merion amongst other titles, is our tour guide for this review of the statistics revolution in his January Golf Digest piece called Crunch Time. David's brief explanation of the source of the raw data is as follows: 

It all grew out of the PGA Tour realizing in the mid-1990s that its score-reporting system would need an upgrade for the 21st century. Hand-held digital devices rather than a pencil and paper were clearly the way to go for walking scorers to update leader boards more efficiently, and while they were at it someone had the forward-thinking idea of devising a system that could report the precise result of every shot hit on tour. It took years of development and a major financial investment to map each course and provide the resources for a laser measuring system at each tour event, but ultimately ShotLink was born, giving the tour an extensive database currently managed along with technology partner CDW. Suddenly, the tour had new stats like percentage of putts made from various distances, average distance from the hole on approach shots from various yardage ranges, tendencies to miss left or right with tee shots and many more. But how to make some sense out of all those numbers? The tour's idea was to utilize the brainpower of America's university system, spreading word that customized ShotLink data would be made available to researchers.
The decision to make the information available to outside researchers should be commended, and seems quite out of character for Commissioner Ratched Finchem, but the staff at Unplayable Lies strives to give credit where credit is due. The most active user of this data has been Columbia professor Mark Broadie, who developed the Strokes Gained Putting stat unveiled as recently as 2011. In a nutshell, Strokes Gained works as follows: 
For tour pros the expected score is determined by the tour average from a given position, based on reams of ShotLink data. An example would be a putt from just inside eight feet, which a pro is expected to make half the time. A make gives him .5 strokes gained, and a miss is -.5.
ShotLink was the necessary ingredient, identifying the location of each and every shot in a PGA Tour event. But the Strokes Gained concept is not limited to putting, and can be applied to any shot, although with the exception of tee shots a multiplicity of factors are involved, including lies and obstacles. Back to Barrett:
Broadie has tackled this problem by determining zones on tour courses from which a recovery shot is required based on historical data, and in those cases penalizing the shot that put the player in that position. Sanders mines the ShotLink data for the PGA Tour pros he now works with and is able to break down missed fairways into the categories of good miss, poor miss and no shot, developing an algorithm to determine the worst outcome.
Broadie is constantly adjusting his metrics, as you'd expect early in such a process. But his early results are intriguing, which will rock Drive for Show, Putt for Dough traditionalists back on their heels:
Many of the results show that conventional wisdom is not to be trusted. For example, breaking down the ShotLink numbers for the top 40 players from 2004 to 2012, Broadie shows that approach shots accounted for 40 percent of their scoring advantage, driving accounted for 28 percent, the short game (shots off the green and inside 100 yards) for 17 percent and putting for 15 percent.
On the subject of driving, Broadies analysis shows that distance is far more important than accuracy, and Bubba Watson leads his Shots Gained Driving metric:
That's the reason long hitters like Bubba Watson populate the top of the strokes gained/driving standings, though accuracy is important enough to hurt a very wild driver like distance-leader Luke List. A 20-yard advantage in driving distance leads to a fractional advantage on every stroke, and over the long run that adds up. Strokes gained/driving also reflects the advantage gained by being able to go for the green on reachable holes more often, an edge that isn't reflected in traditional stats like greens in regulation.
Players such as Luke Donald, anxious for any edge, are embracing the analytical techniques and hiring professionals to help them cut through the vast volumes of data. Most seem to view it as a means to help allocate practice time. One point that would interest me is whether the results of such analyses surprise Luke and Zach, or just confirmed their sense of their own strengths and weaknesses.

Much more after the jump.


Brandt Snedeker has also hired a specialist in this field, though he's loathe to talk about it publicly. But he's confirmed that the data is being used to drive hole-specific strategy. One can readily imagine that these advisers will be consulting with their clients next week as to the optimal strategy to play Riviera's 10th hole, perhaps the most interesting reachable Par 4 on the planet.

What does the future hold? Since the Tour's laser equipment is getting long in the tooth, we're likely to see a major upgrade in the near future, which Barrett speculates could include video capture. One of the major weaknesses of the current data capture efforts is that it ignores all four of the majors, all of which are not under the Tour's control. The Masters would seem like the most logical place to start, since it's at the same venue each year, but the folks at ANGC march to their own drummer. It wasn't until recent years that they even let us see the front nine, and there's still no blimp aerial coverage (think Bubba's gap wedge would have looked cool from the blimp?) or on course reporters, so don't hold your breath. 

ShotLink data has also been used by the USGA to analyze issues such as distance gains, including in combination with changes in course set-up. It's also been used extensively in the renovation of Tour golf courses, giving the architect a vast amount of real world data to use in placing hazards. I remember that at TPC Sawgrass our caddie told us that Pete Dye used ShotLink data to decide where to put trees to the right of the 18th fairway. He analyzed a ShotLink generated scatter diagram of birdies made from this bailout area, and every blue dot got a seedling.

There's a wealth of interesting anecdotes and insights that make Barrett's long piece well worth the time. Golf Digest also helpfully provides links to related pieces on this subject, including this earlier piece by Barrett on what the numbers say about Tiger and Rory (but a warning, it's a January 2013 piece that analyzes their 2012 season). But take a look at these 2013 Strokes Gained statistics as calculated by Mark Broadie: 


I find this listing incredibly interesting, and each time I go back to it something new jumps out. In general it seems to confirm my previous understanding of a given player's abilities (though not necessarily the magnitude), but there's always surprises (Sergio & Phil's putting prowess, as a for instance). I'd love a look at the full data set, as I'm dying to see how bad a putter Justin Rose is. Seriously, 11 strokes better than the field tee-to-green over 72 holes, how does he not win more often? As Bill James would tell you, there's assumptions built upon assumptions in creating a statistical measure of performance, and all such efforts need to be subjected to extensive real world testing. The good news is that many others are using this data and challenging Broadie's assumptions, including Broadie himself.

No comments:

Post a Comment