Overtime: On Small Sample Sizes

April 25, 2018

By Jack McLoone

When Mike Trout is on top of the WAR leaderboards, the sample size is big enough (Courtesy of Twitter).

Every discussion of early-season baseball is (or at least should be) hounded by one qualifier, in one form or another: “it’s early, so don’t take everything at face value.”

This is especially true in the discussion of any stats, as anyone that has polled ten people about any topic before looking at it more broadly can tell you that small sample sizes are not necessarily indicative of what you see in a larger sample.

Exempt from all rules of small sample size – except for when it makes him look bad – is Shohei Ohtani, who can do no wrong and is only capable of special things.

So, then, the question is what do we consider as a big enough sample? When can we say that the statistics we are seeing for a player are real and indicative of his true talent level. There are a few answers to this.

The first is the purely statistician answer: probably three to five seasons. Yes, seasons. In terms of statistical significance, player performance can vary too much year-to-year for just one season to be a fair indicator of one’s skill.

If multiple seasons is one option, then naturally “one full season” is the next option but, again, while statistically significant (and I am here for baseball statistics), it is also baseball-boring. (That hyphen is to show that I am not using “baseball” as a synonym for boring, but instead to say that it is boring in the context of baseball. Please do not twist my sentences, or I will be silently mad online.)

The more conventional thought is “a month or so,” or maybe to the All-Star Break. The “month or so” option makes sense logically, as you can assume that that’s when the rust has worn off, the best players are getting consistent reps and that the weather has warmed up to allow for more normal baseball. However, then you get a season like this year, with April snow canceling numerous games and probably setting that time back that timeline.

Extending that to the All-Star Break still doesn’t solve the problem, as it’s become pretty well known that half a season is not a good sample size either. Take a look at any first half All-Star, whether its Jason Vargas last year (he was maybe the worst starter in baseball in the second half) or Brandon Inge when he was in the Home Run Derby in 2009 after hitting 15 homers in the first half (he hit none in the second).

If “a month or two” and “give him till the All-Star Break” might be flawed also, then where do we go? Well, Ben Lindbergh and Jeff Sullivan of the FanGraphs podcast Effectively Wild have proposed a new standard cutoff point: whenever Mike Trout is on top of both the FanGraphs and Baseball Reference WAR leaderboards. It makes sense; once enough of the noise of early-season rust and hot starts have faded, the best player in baseball (Ohtani is not there… yet) should be on top of the leaderboard.

Well, as of writing, Trout sits on top of the fWAR leaderboards with a 1.7 and bWAR with around 2.0, and has for about a week now. And while Lindbergh and Sullivan mostly use this as a joke, it makes a lot of sense to me. It just needs a little tweaking.

For one, just “whenever Trout is on top” is a little tough, because obviously he is susceptible to hot starts then fading as well. So I think it’s fair to say we have to give it at least three weeks before we can call it a large enough sample.

Second, Trout should stay on top of the leaderboard for at least a week, just to show that it is stable. While that has set the bar now at a month, there’s a little more I want to put in.

The last and final benchmark is that two-thirds of the rest of the top ten from the season before, barring injuries or retirement, needs to be in the top ten as well. I think this makes sense, because the feeling of “it’s been long enough” is simply that we are comfortable with the results we are seeing. In other words, that we think they make sense. By having six of the best players in the league the season prior in the top 10 again, that gives us significance in terms of representation.

At the same time, it allows us to say that the new players who have entered the top 10 are for real, and that maybe it is time to worry about the players who dropped out.

With that benchmark set, it is easy to say it is too soon. Of last year’s top 10 (in both fWAR and bWAR), only Aaron Judge, who led it in Trout’s absence due to a thumb injury that sidelined him for a couple months, is back in the top 10.

This is not to say that, for example, Mookie Betts and Didi Gregorious, numbers two and three on the fWAR leaderboards right now, aren’t for real. But I would be willing to bet that it is Too Soon for Jed Lowrie (fourth) and Matt Chapman (fifth) to be actual top-10 players.

So there is your new checklist for “is it still too soon?” 1) Is Mike Trout on top of both fWAR and bWAR leaderboards? 2) Has it been at least three weeks? 3) Has he been on top for at least a week? And 4) Are two thirds of the rest of last season’s top 10 back on top as well?

Of course a stats person would have a stats-based answer to “When is it no longer too soon to pay attention to the stats,” what did you expect? And like most statistics, it’s still too soon to be sure if this makes sense or not.

The Fordham Ram

The Fordham Ram

The Fordham Ram

Overtime: On Small Sample Sizes

The Fordham Ram

Comments (0)