Image credit: © David Reginek-Imagn Images
In modern baseball, few measurements are more watched than a ball’s velocity off the bat. In and of itself, higher velocity does not guarantee a successful outcome. But it certainly makes a successful outcome more probable, and it is hard to repeat success without it.
Unfortunately, effectively summarizing a player’s seasonal exit velocity is tricky. Unlike many other measurements in life (and baseball), exit velocity does not follow the traditional “bell curve.” Instead, last season’s major-league exit velocity distribution looks like this, with a definite leftward skew:
You can, per usual, report the mean (a/k/a “average”) if you want, but the lopsided curve means that you will miss some of the signal. Because the most desirable contact is concentrated on the high end, many analysts look at either 90th percentile or maximum exit velocity to summarize a player’s exit velocities. Both are an improvement in some respects, but on their own, both leave you with 99 other percentiles still to explain.
Furthermore, we don’t just want to summarize exit velocity, but to recreate it, to build a statistical machine that can estimate what 300 balls in play might look like from any given batter or pitcher. By covering the entire exit velocity distribution, we can try to reproduce the full range of nonlinear interactions with launch angle and other inputs, and move toward a concept of truly deserved exit velocity, as opposed to those that happened to show up in a given plate appearance.
To do this, we must understand exit velocity as part of a phenomenon unique to physical exertion and thus in sports: the distribution of an average maximum athletic effort. Sports are full of examples like this: throwing a football deep down the field, the first serve in tennis, or a 100 meter dash. In these and similar scenarios, each athlete typically strives for maximum performance over a series of opportunities. And for that reason, their performances combine to form a similarly-skewed shape, regardless of sport.
Why the strange shape? Because while athletes could theoretically achieve their maximum with each attempt, they more likely will fall short. A collection of athletes making this same effort over time will have differing average maximums, although similar skill sets will tend to produce broadly similar results. This constant expenditure of maximum average effort is what gives league-wide exit velocity its skew, with the hump pointing toward the average of attempted player maximums, rather than the average of the averages, as is typical of other measurements. How do we model this unusual distribution, and by extension, a player’s effect on exit velocity?
I think the answer lies with the skew normal distribution, which restores invaluable qualities of the normal distribution for this application, while providing a new parameter to control for the skew created by average maximum athletic effort. Using the skew normal distribution[1], we can capture a player’s entire exit velocity distribution, distinguishing them by their “skew means,” and better project a season’s worth of exit velocities. In addition to giving us this new capability, these “skew means”—or if you prefer, “deserved exit velocities”—still measure skill comparable to 90th percentile exit velocity for batters, and substantially improve upon existing, public-facing exit velocity metrics for pitchers.
In this article, we will discuss the theoretical basis for the “skew mean” of exit velocity, demonstrate its impressive performance, and discuss some of its interesting aspects.
Current Approaches
The normal distribution, and its characteristic bell curve, drives the way we report most event rates in sports, and for that matter, most measurements we encounter anywhere — hence the moniker “normal.” The bell curve shape should be familiar:
This distribution is wonderful because normally distributed measurements can be completely described by two parameters: (1) the mean (a/k/a the average); (2) the standard deviation of a typical measurement away from that mean (a/k/a the spread around the average). The usefulness of this cannot be overstated: you can have 50, 150, or 550 measurements of a person or of a population, and yet the range of all plausible measurements, either individually or for the population as a whole, can be boiled down entirely to those two parameters, and as a practical matter, one of them (the average) is usually enough. It is a truly remarkable thing, and our statistical world is built around it, both in sports and in life.
Consequently, virtually every sports rate metric is an average: batting average, earned run average, even on base percentage (which as I’ve noted before, actually is an average, so the name is stupid). Standard deviation plays a smaller role, but an important one: the 20-80 scouting scale famously operates off a mean value of 50, with the values of 40/60, 30/70, and 20/80 corresponding to 1, 2, and 3 standard deviations away from that average. Many metrics (including our cFIP) use standard deviation to put themselves on a more familiar scale, such as being centered at 100 with a standard deviation of 15. Standard deviation (and its cousins, the variance and precision) also play an important role in player projection, as we “shrink” outliers toward their likely deserved mean, using the entire population as a guide.
The reason we can rely on these principles is because the bell curve is symmetric, and measured values are thus equally likely to be below average as above average. But skewed data doesn’t work that way. The average MLB exit velocity is about 88 mph. We are more interested in values that exceed that number, because larger values are more likely to be productive hits. But values below that are still relevant because they can interact productively with other inputs, such as launch angle, and are necessary to fill out the complete profile of the player. That creates two problems: (1) the traditional average tells us less than it usually does; (2) we need to find an alternative way to reflect the extent to which players concentrate and distribute exit velocity, if we want to capture the available information for the player.
This is why, as noted above, many analysts turn to quantiles like the 90th percentile velocity, instead of the mean. It makes sense, although only for batters, as for them the 90th percentile exit velocity is more likely to repeat itself the following season, suggesting that it better reflects batter skill. 90th percentile exit velocity is useless for pitchers, however:
Player Position | Raw Mean | 90th percentile |
---|---|---|
Batter | .77 | .85 |
Pitcher | .42 | .31 |
The 90th percentile thus is helpful if you must boil a batter’s (not a pitcher’s) hard-hit ability down to one number, but again, we want to summarize the entire distribution. We want to know the spread of those numbers. As compared to the league, we want to know If the player’s exit velocities are skewed in a good direction or a bad one. And to paint a more complete picture of the batter that includes launch angle and even spray, we need to know the shape of the entire distribution of the player’s exit velocities, not just their hardest hit ball or even the top 10%.
The Skewed Approach
The skew normal distribution offers a solution to these challenges. It restores our ability to rely on an average exit velocity, although we distinguish our updated value as the batter’s “skew mean.” We now also gain the ability to measure the batter’s concentration of exit velocities through their “skew alpha” and “skew sigma.” (Curiously, “skew sigma” is affected by pitchers, but they do not seem to affect “skew alpha” at all).
These two other parameters embody the concept of concentration, shown below. For variety, this time we will use the distribution of 2023 exit velocities, to show that the population distribution of exit velocity is consistent each season, but this time we’ll add arrows to emphasize the concentration factor:
Why does concentration matter? So far we have focused on skew, but look also at how diffuse the distribution can be, covering a wide range of useful (mid-80s on up) and not-so-useful exit velocities. Generally speaking, we don’t want a batter’s distribution to be more diffuse, because the broader the distribution, the more weak contact the batter (or pitcher) is causing. The “skew sigma” and “skew alpha” quantify this, and are necessary to generate a player’s exit velocity distribution. The former is strongly and negatively correlated with the skew mean, so the lower the skew sigma, the tighter the distribution. The latter is positively correlated with the skew mean, and, at its best values, tends to push the hump more “upright,” further focusing the concentration.
The skew mean largely gives us what we need for summary purposes, though, so we will focus on that here.
The Skewed Approach, Applied
Let’s start by confirming that the skew mean is, in fact, a reliable substitute for existing exit velocity metrics, in terms of summarizing exit velocity skill for batters and pitchers:
Player Position | Raw Mean | 90th percentile | Skew Mean |
---|---|---|---|
Batter | .77 | .85 | .84 |
Pitcher | .42 | .31 | .47 |
Indeed it is. By the Spearman rank correlation, the skew mean restores reliability to the concept of average exit velocity for batters, comparable to the 90th percentile. For pitchers, the skew mean clearly beats them both, meaning we now for the first time have a summary metric that can validly be applied to both batters and pitchers.
We have, in other words, restored the power of the mean to our exit velocity distribution, which in addition to allowing us now to fit an entire distribution for each player, means we can use the skew mean from now on as our master exit velocity metric for everybody. The skew mean values are pretty close to the raw averages, but much more accurate on the whole.
Of course, we want to be able to reproduce individual player distributions, not just summaries. So let’s demonstrate our ability to do this. We will highlight two extremes.
First, the actual exit velocity distribution of Aaron Judge, followed by three random draws from our skew normal “machine,” predicting his overall exit velocity distribution:
Although these estimates have been tweaked for platoon tendencies, note how closely we are able to cover the entire expected distribution for Aaron Judge’s exit velocity with our simulated draws of his 2024 output. Judge’s preeminent skew mean exit velocity operates both to minimize unproductive batted balls as well as concentrate his distribution at the high end.
By contrast, consider consensus AL Cy Young winner Tarik Skubal:
Our model substantially reproduced Skubal’s 2024 season also. The clearest difference is how much lower his skew mean exit velocities are: whereas Judge adds about eight miles per hour, on average, to each batted ball, Skubal tends to actually remove one mile per hour before further platoon effects are accounted for. Although the effects are subtle, Skubal’s skew sigma is also a bit higher, meaning that opposing batter exit velocities are more diffusely distributed, and thus more likely to incorporate unproductive areas of the exit velocity spectrum.
A quick word about platoon effects on skew mean exit velocities, using our 2024 model:
Batter / Pitcher Platoon | Average Exit Velocity (mph) | SD around the Average |
---|---|---|
L / L | 85.25 | .21 |
L / R | 87.87 | .16 |
R / L | 88.19 | .15 |
R / R | 87.56 | .14 |
These values have low error rates (yes, two places of precision is appropriate), which not surprisingly correlate inversely with the size of their respective samples in the data. Interestingly, right-handed batters hit lefty pitchers harder than vice versa (I expected the opposite), and the platoon effects of righties on righties are limited, at least when they make contact. The effects of lefties on lefties, though, are truly disastrous, underscoring why left-handed relievers at least used to have guaranteed long-term employment.
Some additional observations:
- Tentative analysis shows that skew mean values in the minor leagues seem to maintain their predictive value in the majors: AAA hitters, for example, tended to lose less than one mph upon promotion. So, analysts can hunt for skew means well before players arrive to the big leagues.
- Aging effects of skew mean exit velocity (and, to be fair, exit velocity in general) tend to be very mild from year to year, so the previous season’s exit velocity distribution is quite likely to be highly predictive of the player’s distribution the following season, for projection purposes.
- Although maximum effort seems intuitively to be driven by pure bat speed, it is possible that the extent to which the pitch is “squared up” could also be part of, or an alternative to, this mechanism.
The models I describe here work well in a Bayesian format, and as usual we model them in Stan. A simplified mode in R, using the brms frontend, can be found in the appendix below, and should work with the Savant data feed for readers who want to explore exit velocity modeling and learn more. The model is easily expanded to jointly model exit velocity with launch angle, including the non-linear (but very clear) correlation between them, and you can expand it further to consider or predict spray angle, park effects, or pitch location, as well as the various connections between them.
The Bottom Line
We are mulling over how best to make use of these exit velocity distributions, as well as the corresponding launch angle and spray distributions we have also developed. We welcome reader feedback on whether readers would like these metrics to be made available to them for the 2025 season, or at least to subscribers, and if so, in what form.
Appendix
The brms documentation is pretty good, so those interested should give this model a try, and also practice expanding the model to jointly model other batted ball characteristics (the skew normal distribution is not a good error distribution for most other variables, which tend not to involve the same type of maximum effort, so modelers likely will get better results with more typical choices).
I have taken the liberty of including some performance enhancements to speed things up, as well as some sensible prior distributions. As usual, starting with smaller datasets (5k to 10k batted balls) will allow you to learn and compare different specifications with manageable run times.
Finally, note that this process requires fitting a distributional model, in which you are looking to predict not just the mean, but also the skew and the spread, each with their own predictor variables. That is how we gain the ability to predict the distribution for each player, while still having reasonable defaults if we have limited information about them.