PyMC reposted this
How do you evaluate a rookie hitter with 50 plate appearances? It's the same problem as forecasting sales for a new store, or estimating a drug effect in a small clinic. Unbalanced grouped data is everywhere. Here's a baseball analytics example, one of several case studies we'll build in our London Bayesian modeling workshop next week. Same 2023 MLB season, two models. The left panel gives every player their own independent estimate. But look at the left side of the plot, where the low-AB players live. Their point estimates are all over the place. Some rookies look like future stars, others look like they should be sent back to the minors, and their intervals are wide enough that either call could easily flip. You are not looking at skill. You are looking at noise amplified by small sample sizes. The right panel is a partial pooling model, where the green dots are the new estimates. The gray x's show where the independent model had them. Now look at the left side: the green dots cluster much more tightly than the gray x's. The extreme estimates get pulled back toward the center. Nate Eaton: 53 AB, .105 in the independent model, .228 after partial pooling. Korey Lee: 65 AB, .101 independent, .227 partial. Both had a handful of bad at-bats and the independent model treated them like they had revealed their true talent. The partial-pooling model knows better. It learns from the whole league, shrinks the noisy extremes back toward the context, and gives you a calibrated estimate instead of a coin flip. Everyday players on the right side barely move. 600 plate appearances do not need borrowing. The model is smart enough to leave them alone. This is one of the most useful methods in applied statistics and it has nothing to do with priors or philosophy. It is a tool for making sound inferences from unbalanced, grouped data. Once you have partial pooling in your toolkit you stop fitting independent models to nested data. 2.5 days in London, June 8–10. Hierarchical models are Session 4. We build the full workflow live: partial pooling, varying intercepts and slopes, non-centered parameterisation, and prediction for new groups. EDIT: ONLY A FEW SEATS REMAINING! 👉 https://dub.sh/ANFk8VH Code LONDON10 gets you 10% off. Monday is day one. Don't sit this one out!