Tuesday, July 10, 2012

Introducing eWAR: Empirical (offensive) Wins Above Replacement

This is continuing a series of posts about a baseball game simulator I made.  For the introductory post, look here.  For a revised look at the 2012 SF Giants lineup, look here, and for a description of how the simulator works, look here.  Also, I've grown tired of referring to the simulator as such, so from here on out I'll refer to it as Basim (short for baseball simulator).


The short version of this article is that I've constructed a new stat, eRAA (empirical runs above average), that seems to correlate more with the number of runs a team scores than does traditional RAA, even accounting for ballpark corrections.  This leads me to believe that eRAA may be a "better" stat than RAA, in that it better predicts how good a player is than RAA does.  This leads to the definition of eWAR, empirical runs above replacement on offense, as 0.1*eRAA (using the canonical value of 10 runs/win).

I originally built Basim, a python script that simulates baseball games for a given lineup, with the intention of seeing which  permutation of nine players produces the best results.  It soon occurred to me, though, that there was something else I could use it for: I could use Basim to evaluate players.  The idea was simple: put a given player in a lineup with eight average players and see how many runs above (or below) average that lineup scores by running hundreds of thousands of simulations of that lineup and recording the average runs per game; multiply it by amount they play, and you get eRAA--a measure of how good a player is, relative to average

The way to test whether eRAA is a "better" or "worse" stat than RAA, the basis for WAR, is to see which, when aggregated for a team, better predicts the number of runs that the team scores.  I decided to run the test on the 2011 baseball season.  My procedure was pretty simple:

1) I took aggregate batting stats for 2011 to find the "average" player, avePlayer*.

2) I ran 10,000,000 simulations of a lineup of nine of that player to find the baseline runs per game (brpg); I found that, for 2011, brpg = 4.1267312.

3) I also recorded the number of plate appearances per game that the test player (batting fifth) would have in the baseline simulation; I got paPerGame = 4.2913712999999998.

4) For each of the ~1,500 2011 major league baseball players, I ran 100,000 Basim simulations with them batting fifth and every other player being avePlayer.  I recorded the number of runs scored per game on average by that lineup, playerRPG.  I then computed eRAA = (playerRPG-brpg)*playerPA/paPerGame, the number of runs above the average player that they produced during the season.

5) For each team, I totaled up the eRAA of each hitter on their team to get their teamRAA.  I added that to the average number of runs scored by a team aveTeamRuns = 693.36, to get eRuns, the number of runs my model would predict them to score in a season.

6) For each team, I also added up Rbat, Rbaser, and Rdp, the three offensive stats contributing to RAA.  All statistics were taken from baseball-reference.com.  However, I believe that Rbat attempts to correct for the park that the hitter plays in.  In order to get an apples-to-apples comparison, I reversed that by multiplying the predicted runs (total, not above average) by the team's ballpark adjustment factor.  (It's possible I messed this step up; my understanding is that this factor should be applied multiplicative to a player's runs created.)  This got me the version of  RAA I tested eRuns against; I'll call it RAAproduced.

7) I also recorded the total runs scored above average runsScoredaboveAverage by each team that season, and looked for correlations between that number, RAAproduced, and eRAA**.

The results: RAAproduced, the classic building block of WAR, had a correlation coefficient of 0.9459 with runsScoredAboveAverage, missing by an average absolute value of 28.6.  eRAA had a correlation coefficient with runsScoredAboveAverage of 0.9592, missing by an average of 19.3 runs per team.

So, while both RAAproduced and eRAA correlated very well with the number of runs a team scored, eRAA correlated slightly stronger, leading me to believe that eRAA better predicted a player's value to a team than did RAA.

In order to see what eRAA said about individual players, I found the 2011 hitters with the highest eWAA, empirical offensive wins above average (eRAA/10):

Jose Bautista
Matt Kemp
Miguel Cabrera
Ryan Braun
Prince Fielder
Jacoby Ellsbury
Lance Berkman
Joey Votto
Adrian Gonzalez
Justin Upton
Curtis Granderson
Mike Napoli
Jose Reyes
Albert Pujols
Troy Tulowitzki

There are of course a number of claims in this article with which one could take issue.  Not everything is quite an apples to apples comparison; I tried to turn WAR into the form most apt to compare to eWAR, but WAR still attempts to correct for managerial decisions (e.g. intentional walks) in a way mine doesn't.  Also, WAR also has already developed adjustments for ballpark, position, and fielding that eWAR hasn't.  I didn't tweak my formula at all after looking at the 2011 data set, but I should still run it in a truly out of sample year; I'll run it on 2010 and see how it performs there.  Also, there is still a lot of work to be done on Basim.  In particular, it makes arbitrary decision with runners on first and other bases at the same time, and doesn't have much granularity on taking extra bases.  It also doesn't handle pitchers particularly well, and arbitrarily bats the test player 5th in the lineup.  But those are all the more reason to believe that with more tuning eWAR could be a potential complement, or even supplement, for oWAR.

*I summed up the stats of every player in the league in 2011 to create the averagePlayer::
BB + IBB + HBP:15018+1554+1231
Hits: 42267
Doubles: 8399
Triples: 898
Home Runs: 4552
Stolen Bases: 3279
Caught Stealing: 1261
Strikeouts: 34488
Ground into Double Play Rate: .1
Stolen Base Opportunities: 67623
Reached on Error: 1816
Extra Bases Taken Percentage: .41

**For reference, a list of (eRAA,RAAproduced,runsScoredAboveAverage) for each team in alphabetical order fo 2011 team abbreviation: [(45.200205677844551, 21.785200000000032, 37.259999999999991), (-44.774112414835749, -92.267200000000003, -51.840000000000032), (-5.1114567038280354, -49.814399999999978, 14.580000000000041), (179.73419545402641, 160.32159999999999, 181.44000000000005), (-38.23559280456594, -60.374400000000037, -38.879999999999995), (-51.724434797799972, -91.083600000000047, -38.879999999999995), (45.215654725565059, 42.115200000000073, 42.120000000000005), (-37.44421849491345, -46.934399999999982, 11.339999999999918), (54.75322142365065, 20.457599999999957, 42.120000000000005), (75.629775964620023, 81.570800000000077, 93.960000000000036), (-40.913050474099116, -62.37360000000001, -68.040000000000077), (-110.38232860437888, -66.712800000000016, -77.759999999999991), (20.932611680559809, 25.113600000000019, 37.259999999999991), (-3.4382776433257294, -30.865199999999959, -25.919999999999959), (-18.522117860088237, -56.98720000000003, -45.360000000000014), (58.721202940421378, 49.368000000000052, 27.539999999999964), (-96.570937732654357, -132.51800000000003, -74.520000000000095), (24.467027544318942, 21.412799999999947, 24.299999999999955), (118.53137657885713, 112.70880000000011, 173.33999999999992), (-103.13524462448632, -101.31079999999997, -48.600000000000023), (27.872957718666616, 26.268000000000029, 19.440000000000055), (-125.28292385233611, -117.97440000000006, -82.620000000000005), (-122.5675836066669, -137.34879999999998, -100.43999999999994), (-150.53768491204667, -156.28160000000003, -137.69999999999993), (-103.57234010489849, -132.33960000000002, -123.12), (79.896154919058134, 22.331999999999994, 68.039999999999964), (30.389992028888202, 0.6512000000000171, 12.960000000000036), (175.26261892090292, 236.04119999999989, 162.0), (41.654865473886879, -8.6655999999999267, 50.219999999999914), (-64.042960207148923, -77.0, -64.800000000000068)]


  1. Hey Sam,
    I don't think your argument is as strong as you think and here's why. Your correlation coefficient is itself a statistical measure based on 30 teams (30 data points) and has its own uncertainty. Using this calculator http://vassarstats.net/rho.html

    Your 95% confidence intervals are
    RAAproduced correlation is in [0.889,0.974]
    eRAA correlation is in [0.916, 0.98]

    As you can see, these confidence intervals have a huge overlap so you don't have enough data to conclude that your metric is better.
    --Matt Houston

  2. Oh, I should also add that I like the blog a lot.

  3. (Thanks!) Yeah, I'm going to run it on more data points soon; the problem is just the time it takes to simulate...

  4. Fantastic read. I'm a big fan of where this is going.