Monday, February 02, 2009

BABIP II: Regression to the Mean

Brett offers up some interesting thoughts on BABIP:
I'm not nearly as up on the latest research as I should be, but as far as I know, the current thought is that pitchers regress to a standard mean, where hitters tend to regress to their own mean....I forget where I saw it, but there was just a study done which ran a regression on many different variables and tried to predict a hitter's BABIP. I'll post it if I can find it...
I'm not sure whether or not this is the article Brett is talking about, but Hardball Times ran a pretty comprehensive piece on it this winter called Batters and BABIP.

The article is an excellent attempt to try and take a "number of factors" and distill them into an expected BABIP for hitters. The theory mirrors what Brett says: that hitters have their own unique mean for BABIP.

There are two inherent problems with using either this model or the model that the article refers to as old xBABIP.

First, the article points out that it uses seven years of data to determine an expected BABIP rate. This is problematic because there's a difference between a player who was ages 23-29 in the model and a player who was between ages 30-36 in the model. True, the model has a factor called "year" that is "A vector of year variables from 2002 through 2007, to account for potential time effects." But I believe there is too much variability in this one factor to account for it accurately in the model.

The second problem is that the linked article only attempts to use one season to gauge how accurate or inaccurate xBABIP is. We would need to see multiple seasons and success or failure rates before we knew how accurate or inaccurate this model is.

The biggest problem I have, though, is that xBABIP doesn't tell me much more than BABIP does. Players with a high BABIP are bound to slip the following year. Rotisserie values are greatly anecdotal to this kind of analysis (which is why I don't like to use them), but regardless of what an xBABIP rate tells me, I'm going to guess that hitters that were at the top of the heap in BABIP are going to fall in Rotisserie value in 2009 and vice versa.

But maybe not.

If you look back at the linked Hardball Times article, you'll notice that some of the hitters on last list (2008's "unluckiest hitters") are high strikeout hitters who also may have a poor approach at the plate. There are also a few older hitters on the list.

This doesn't surprise me. Age is a factor in how effective your at bats are going to be, as well as approach. These things aren't easy to measure, but I don't necessarily agree with the idea that Nick Swisher is going to have a big bounce back this year (because he doesn't make great contact) or that Jason Giambi is going to be a bargain (because he's reaching a point on the age curve where the decline is inevitable, BABIP or no).

I think more research is needed to prove or disprove some of these theories, which - as theories - are great. Don't get me wrong...I eat this kind of stuff up with a ladle on my own time.

But for now, I'd be careful not to put too much weight into these datasets from a Rotisserie standpoint.

1 comment:

Brett said...

Yeah, pretty sure that was it. Granted it's not perfect, but it's definitely a start.

"The biggest problem I have, though, is that xBABIP doesn't tell me much more than BABIP does. Players with a high BABIP are bound to slip the following year."

Well, the idea (and this is just stating the obvious, I think) is that what's high for one player is low for another. Assuming you do trust xBABIP, and see that both Carl Crawford (xBABIP of .330) and Blake DeWitt (xBABIP of .269) had a BABIP of .294, you'd expect Crawford to get "better" and DeWitt to get "worse".

It also makes sense that the highest xBABIP is in the .340s, but the highest BABIPs are around .400, and there are several of them (same thing on the low end). The distribution of the xBABIPs are clustered much closer to the mean. Just like how you know that SOME pitchers are going to have ERAs over 7 or in the low 2s, but you wouldn't predict ANY of them to do that.

As for making contact (eg. Swisher), intuitively I agree with the authors that contact rate and BABIP should be inversely proportional.

They wrote: "One might expect a higher contact rate to lead to a higher BABIP, but the opposite actually seems to be the case. This is likely caused by the correlation between strikeouts and power, since players who swing hard tend to either miss entirely or crush the ball for hits."

An analogy is that (I believe) Pedro Martinez was among the league's worst in BABIP in some years (at least one). It seems hard to believe but his value was caught up to his k-rate (and low walk rate, though that's irrelevant here). His BABIP was actually high, but that mattered less because the number of balls in play was so low. Same thing with Swisher - his BABIP can be high because you're only considering the balls in play. Of course his batting average will still be low, just as Pedro's batting average against was.