Monday, January 23, 2012

Is xFIP useful to predict future performance?

One of the sabermetricians favorite metrics is xFIP. By now, most of you should have at least a rough idea of what xFIP is, but if you don't here is the definition:

Expected Fielding Independent Pitching (xFIP) is a regressed version of FIP, developed by Dave Studeman from The Hardball Times. It’s calculated exactly the same as FIP, except it replaces a pitcher’s homerun rate with the league-average rate (10.6% HR/FB) since pitcher homerun rates have been shown to be very unstable over time. A pitcher may allow a homeruns on 12% of their fly balls one year, yet then turn around and only allow 7% the next year. Homerun rates can be very difficult to predict, so xFIP attempts to correct for that. Along with FIP, xFIP is one of the best metrics at predicting a pitcher’s future performance.

Sounds great! I'm always looking for an edge in my Rotisserie League. Sign me up!

Ten Most Favorable ERA/xFIP Differentials 2010 (min 100 IP)
2010 ERA
2010 xFIP
2011 ERA

The chart above lists the 10 pitchers who pitched at least 100 innings in both 2010 and 2011 and had the most favorable xFIP differentials compared to their ERA. The expectation would be that – generally speaking – these pitchers’ ERAs would have almost all improved in 2011. At a glance, then, xFIP looks like a rousing success. Eight out of 10 of the pitchers on this list saw their ERAs improve; only Nolasco and Morrow took a step back.

However, a closer look indicates that it isn't quite this simple. Despite the fact that Francis and Hammel’s ERAs improved, the improvement was marginal. As a predictive tool, you would have been better off using ERA and simply ignoring xFIP in four out of the 10 cases in the chart above. Looking at it through this lens, six pitchers out of 10 had a 2011 ERA closer to their 2010 xFIP than their 2010 ERA.

But 10 pitchers is an admittedly paltry sample size. What happens if you expand this to include all 100+ inning pitchers in 2010 and 2011?

It doesn't help. 

One hundred and three pitchers pitched at least 100 innings in 2010 and 2011. Of these pitchers, 2010 xFIP was a better predictor of future ERA for 53 pitchers, 2010 ERA was a better predictor for 49 pitchers, and Tommy Hanson "tied" using this model. Expressed as a percentage, xFIP was better than ERA at predicting future ERA 52% of the time.

However, there are some seasons in here that are statistically close. Doug Fister posted a 4.11 ERA and a 4.10 xFIP in 2010. While saying xFIP is a better predictor in this case is technically correct, you could have used ERA or xFIP to predict future success and come up with more or less the same result.

In order to compensate for this, I took out all of the pitchers whose 2010 xFIP and ERA came within 0.3 of each other. Eliminating these seasons, xFIP was a better predictor of future ERA in 38 cases and prior ERA was better in 31 cases. This helps xFIP's cause somewhat, but it is still only a better predictor of future ERA 55% of the time.

Does this mean that xFIP is worthless? Certainly not. However, there wasn’t a strong corollary last year between prior xFIP and future performance. At the very least, analysts should use caution when simply trotting out xFIP and saying that it's certain that a pitcher is going to get better because his xFIP says so.

I don't exclude myself from this conclusion. Last year, I used xFIP a lot without performing a simple test like the one above. I sometimes take others to task for drawing broad conclusions on the strength of metrics without backing them up, so I have to offer my own mea culpa here and say I'll be more vigilant in the future.

1 comment:

Rotoman said...

Another way to test the systems for predicting future ERA is to run a correlation function on the predictive ERAs and compare them to the actual 2011 ERA. I did this last November for ERA, FIP and xFIP in 2010 compared to 2011 ERA (which is the number we want to know).

I'll collect this data and post it on Ask Rotoman at some point, but the R of the correlation (the way the data aligned for each of the predictors) was about .35 for Last Year's ERA, .43 for FIP and .46 for xFIP.

The way this works, a 1 would mean that the data sets were completely congruent. A 0 would mean that there was no relation whatsoever between the two sets. I think these scores demonstrate that xFIP is ever so slightly better than FIP, which is a good tick better than LY ERA, but all of them are random enough that none of them are what you would call Good.

I'm incorporating more of these component stats in my projection baselines this year, and I'm pretty pleased about the way the weighted averages feel. But I'm under no illusion that this sort of thing is a great leap forward. It's just a little tweak that seems to help us see a little bit better what's going on.