So… What Makes a MLB All-Star?

The first “All-Star Game” in the history of professional sports was a baseball exhibition during the 1933 World’s Fair.  Since then, the “Midsummer Classic” has become an annual tradition, with only an interruption in 1945 due to World War II.  Although ostensibly an exhibition game for fans, it has become an important negotiation tool for players.  Free agents often use All-Star Game appearances to leverage more money from their teams.  Players that make All-Star Games can also gain additional endorsement opportunities, especially at the national level.  As such, making an All-Star Game is an important achievement for the financial prospects of professional baseball players.  Beyond their careers, players often use All-Star appearances as marketing tools to secure post-playing endorsements, much like Stan Ross.

The Original 1933 MLB All-Star Team.

The Original 1933 MLB All-Star Team.

But with All-Stars being selected with no defined guidelines, what factors go into being selected to the game?  I attempted to find the most significant variables that led to All-Star Game appearances by looking at data from every position player with over 200 plate appearances in 2014, enough to qualify a player as “full-time”.  This equated to 383 players, of whom 47 were selected to the All-Star Game.  The independent variables used were games played, plate appearances, runs, hits, doubles, triples, home runs, runs batted in, stolen bases, times caught stealing (coded in reverse), walks, strike outs, batting average (multiplied by 1000), on-base percentage (multiplied by 1000), slugging percentage (multiplied by 1000), double plays hit into, times hit by pitches and sacrifice hits.

These independent variables selected were standard batting statistics, as defined by Major League Baseball.  Advanced statistics were not used  because they are not as commonly cited among mainstream media.  With fan vote playing heavily into selection, and with most fans gaining their baseball knowledge from mainstream media, standard batting statistics were judged to be the most accurate measure for this study.  Statistics were gathered from Baseball Reference, exported into CSV format, and filtered using Microsoft Excel.

After preparing the data, a binary logistic regression was run to see which independent variables were statistically significant in predicting an All-Star Game appearance.  The model found the variables of games played, times caught stealing, strikeouts, on-base percentage and slugging percentage significant at the 90% confidence level.  These variables were 91.1% correct in predicting All-Star appearances.

2014 All-Star Game MVP Mike Trout.

2014 All-Star Game MVP Mike Trout.

Heading into this study, I hypothesized that some of the factors that would be significant would include home runs, runs batted in, batting average and stolen bases.  Although none of these were found to be statistically significant, some of their absences are potentially explained by several of the factors that did make the cut.  Slugging percentage measures total bases divided by at bats, which would lead to a higher overall percentage for players who hit a large number of home runs.  Additionally, although batting average is a traditionally important measure, on-base percentage is a more accurate measure of how often players get on base, by including walks into the equation.

Putting statistical analysis to the side, why were these variables statistically significant?  Taking the factors one by one, there are several possible explanations.  Players that are able to stay healthy for a full season have more opportunities to expose themselves to fans, which explains games played being a statistically significant variable.  Although not a perfect relationship, typically players that are caught stealing more tend to attempt more steals.  Players that attempt more steals tend to be allowed more opportunities due to their ability to successfully steal bases.  Strikeouts (which have a negative coefficient, meaning that more strikeouts have a negative effect on the chance to make the All-Star Game) have an obvious inverse effect on a team’s chance to win, representing a blatantly obvious out.  Due to the visibility of the strikeout, fans are likely to be negatively influenced by players that strike out exorbitant amounts.  The appearance of on-base percentage as a statistically significant factor may point to a shift towards a more advanced analytical mindset from fans.  In the past, on-base percentage was often minimalized in favor of batting average, which does not take into account walks.  On-base percentage considers the traditional batting average (initially hypothesized by the author to be statistically significant prior to the study), and takes into account the walk, which affords a base runner, and therefore an additional opportunity at a run.  Finally, as mentioned earlier, slugging percentage could explain the absence of home runs as a statistically significant factor.  Slugging percentage also takes into account doubles and triples, which also increase expected runs for a team.

Baseball players can take away several lessons from this study.  According to the data, players looking to make All-Star Games, and therefore maximize their potential earnings, have a few areas of the game to focus on in training.  The first is endurance training, to help increase their total games played.  Additionally, they can work on speed, to help avoid getting caught stealing.  When they step up to bat, it is important to have a patient approach, to avoid strikeouts while maximizing their on-base percentage by not only hitting well, but by drawing more walks.  Finally, players should work on strength training to hit more home runs and achieve a higher slugging percentage.

Although this study observes a statistically significant sample, it should be warned that there is the potential for outliers in the 2014 season to affect a general fit on other seasons.  Future models will look at not just a single season, but multiple seasons to account for the potential of outliers.  Although a model that includes all seasons since 1933 (excluding 1945) would not be as accurate due to the change in fan attitudes and on-field strategy, looking at the past ten seasons would serve as a potentially more accurate model than the one built in this initial study.

Table 1

Statistically Significant Factors in Determining MLB All-Star Game Appearances Based on 2014 Data

Independent Variable Beta Coefficient Significance
G .060 .00
CS .096 .10
SO -.012 .10
OBP .971 .06
SLG .976 .05
Constant -22.249 .00