Measuring Shadows: Introducing eWAR: Empirical (offensive) Wins Above Replacement

This is continuing a series of posts about a baseball game simulator I made. For the introductory post, look here. For a revised look at the 2012 SF Giants lineup, look here, and for a description of how the simulator works, look here. Also, I've grown tired of referring to the simulator as such, so from here on out I'll refer to it as Basim (short for baseball simulator).

________________________________________

The short version of this article is that I've constructed a new stat, eRAA (empirical runs above average), that seems to correlate more with the number of runs a team scores than does traditional RAA, even accounting for ballpark corrections. This leads me to believe that eRAA may be a "better" stat than RAA, in that it better predicts how good a player is than RAA does. This leads to the definition of eWAR, empirical runs above replacement on offense, as 0.1*eRAA (using the canonical value of 10 runs/win).

I originally built Basim, a python script that simulates baseball games for a given lineup, with the intention of seeing which permutation of nine players produces the best results. It soon occurred to me, though, that there was something else I could use it for: I could use Basim to evaluate players. The idea was simple: put a given player in a lineup with eight average players and see how many runs above (or below) average that lineup scores by running hundreds of thousands of simulations of that lineup and recording the average runs per game; multiply it by amount they play, and you get eRAA--a measure of how good a player is, relative to average

The way to test whether eRAA is a "better" or "worse" stat than RAA, the basis for WAR, is to see which, when aggregated for a team, better predicts the number of runs that the team scores. I decided to run the test on the 2011 baseball season. My procedure was pretty simple:

1) I took aggregate batting stats for 2011 to find the "average" player, avePlayer*.

2) I ran 10,000,000 simulations of a lineup of nine of that player to find the baseline runs per game (brpg); I found that, for 2011, brpg = 4.1267312.

3) I also recorded the number of plate appearances per game that the test player (batting fifth) would have in the baseline simulation; I got paPerGame = 4.2913712999999998.

4) For each of the ~1,500 2011 major league baseball players, I ran 100,000 Basim simulations with them batting fifth and every other player being avePlayer. I recorded the number of runs scored per game on average by that lineup, playerRPG. I then computed eRAA = (playerRPG-brpg)*playerPA/paPerGame, the number of runs above the average player that they produced during the season.

5) For each team, I totaled up the eRAA of each hitter on their team to get their teamRAA. I added that to the average number of runs scored by a team aveTeamRuns = 693.36, to get eRuns, the number of runs my model would predict them to score in a season.

6) For each team, I also added up Rbat, Rbaser, and Rdp, the three offensive stats contributing to RAA. All statistics were taken from baseball-reference.com. However, I believe that Rbat attempts to correct for the park that the hitter plays in. In order to get an apples-to-apples comparison, I reversed that by multiplying the predicted runs (total, not above average) by the team's ballpark adjustment factor. (It's possible I messed this step up; my understanding is that this factor should be applied multiplicative to a player's runs created.) This got me the version of RAA I tested eRuns against; I'll call it RAAproduced.

7) I also recorded the total runs scored above average runsScoredaboveAverage by each team that season, and looked for correlations between that number, RAAproduced, and eRAA**.

The results: RAAproduced, the classic building block of WAR, had a correlation coefficient of 0.9459 with runsScoredAboveAverage, missing by an average absolute value of 28.6. eRAA had a correlation coefficient with runsScoredAboveAverage of 0.9592, missing by an average of 19.3 runs per team.

So, while both RAAproduced and eRAA correlated very well with the number of runs a team scored, eRAA correlated slightly stronger, leading me to believe that eRAA better predicted a player's value to a team than did RAA.

In order to see what eRAA said about individual players, I found the 2011 hitters with the highest eWAA, empirical offensive wins above average (eRAA/10):

Name	eWAR
Jose Bautista	9.41
Matt Kemp	8.38
Miguel Cabrera	7.83
Ryan Braun	6.85
Prince Fielder	6.58
Jacoby Ellsbury	5.61
Lance Berkman	5.38
Joey Votto	5.34
Adrian Gonzalez	5.23
Justin Upton	4.94
Curtis Granderson	4.61
Mike Napoli	4.41
Jose Reyes	4.24
Albert Pujols	4.15
Troy Tulowitzki	4.14

There are of course a number of claims in this article with which one could take issue. Not everything is quite an apples to apples comparison; I tried to turn WAR into the form most apt to compare to eWAR, but WAR still attempts to correct for managerial decisions (e.g. intentional walks) in a way mine doesn't. Also, WAR also has already developed adjustments for ballpark, position, and fielding that eWAR hasn't. I didn't tweak my formula at all after looking at the 2011 data set, but I should still run it in a truly out of sample year; I'll run it on 2010 and see how it performs there. Also, there is still a lot of work to be done on Basim. In particular, it makes arbitrary decision with runners on first and other bases at the same time, and doesn't have much granularity on taking extra bases. It also doesn't handle pitchers particularly well, and arbitrarily bats the test player 5th in the lineup. But those are all the more reason to believe that with more tuning eWAR could be a potential complement, or even supplement, for oWAR.

_____________________

*I summed up the stats of every player in the league in 2011 to create the averagePlayer::

PA:185245

BB + IBB + HBP:15018+1554+1231

Hits: 42267
Doubles: 8399
Triples: 898
Home Runs: 4552
Stolen Bases: 3279
Caught Stealing: 1261
Strikeouts: 34488
Ground into Double Play Rate: .1
Stolen Base Opportunities: 67623
Reached on Error: 1816
Extra Bases Taken Percentage: .41

**For reference, a list of (eRAA,RAAproduced,runsScoredAboveAverage) for each team in alphabetical order fo 2011 team abbreviation: [(45.200205677844551, 21.785200000000032, 37.259999999999991), (-44.774112414835749, -92.267200000000003, -51.840000000000032), (-5.1114567038280354, -49.814399999999978, 14.580000000000041), (179.73419545402641, 160.32159999999999, 181.44000000000005), (-38.23559280456594, -60.374400000000037, -38.879999999999995), (-51.724434797799972, -91.083600000000047, -38.879999999999995), (45.215654725565059, 42.115200000000073, 42.120000000000005), (-37.44421849491345, -46.934399999999982, 11.339999999999918), (54.75322142365065, 20.457599999999957, 42.120000000000005), (75.629775964620023, 81.570800000000077, 93.960000000000036), (-40.913050474099116, -62.37360000000001, -68.040000000000077), (-110.38232860437888, -66.712800000000016, -77.759999999999991), (20.932611680559809, 25.113600000000019, 37.259999999999991), (-3.4382776433257294, -30.865199999999959, -25.919999999999959), (-18.522117860088237, -56.98720000000003, -45.360000000000014), (58.721202940421378, 49.368000000000052, 27.539999999999964), (-96.570937732654357, -132.51800000000003, -74.520000000000095), (24.467027544318942, 21.412799999999947, 24.299999999999955), (118.53137657885713, 112.70880000000011, 173.33999999999992), (-103.13524462448632, -101.31079999999997, -48.600000000000023), (27.872957718666616, 26.268000000000029, 19.440000000000055), (-125.28292385233611, -117.97440000000006, -82.620000000000005), (-122.5675836066669, -137.34879999999998, -100.43999999999994), (-150.53768491204667, -156.28160000000003, -137.69999999999993), (-103.57234010489849, -132.33960000000002, -123.12), (79.896154919058134, 22.331999999999994, 68.039999999999964), (30.389992028888202, 0.6512000000000171, 12.960000000000036), (175.26261892090292, 236.04119999999989, 162.0), (41.654865473886879, -8.6655999999999267, 50.219999999999914), (-64.042960207148923, -77.0, -64.800000000000068)]

5 comments:

UnknownJuly 10, 2012 at 7:35 PM
Hey Sam,
I don't think your argument is as strong as you think and here's why. Your correlation coefficient is itself a statistical measure based on 30 teams (30 data points) and has its own uncertainty. Using this calculator http://vassarstats.net/rho.html

Your 95% confidence intervals are
RAAproduced correlation is in [0.889,0.974]
eRAA correlation is in [0.916, 0.98]

As you can see, these confidence intervals have a huge overlap so you don't have enough data to conclude that your metric is better.
--Matt Houston
UnknownJuly 10, 2012 at 7:36 PM
Oh, I should also add that I like the blog a lot.
UnknownJuly 10, 2012 at 7:39 PM
(Thanks!) Yeah, I'm going to run it on more data points soon; the problem is just the time it takes to simulate...
UnknownJuly 10, 2012 at 11:02 PM
Fantastic read. I'm a big fan of where this is going.
UnknownJuly 11, 2012 at 12:00 AM
Thanks James!

Measuring Shadows

Tuesday, July 10, 2012

Introducing eWAR: Empirical (offensive) Wins Above Replacement

5 comments:

Contributors

Labels

Blog Archive