**Title:** Forecast combinations: An over 50-year review

**Authors:** Xiaoqian Wang, Rob J. Hyndman, Feng Li and Yanfei Kang

**Journal:** International Journal of Forecasting

**Summary:**

The idea of combining multiple individual forecasts dates back to Francis Galton, who in 1906 visited an ox-weight-judging competition and observed that the average of 787 estimates of an ox’s weight was remarkably close to the ox’s actual weight; see Surowiecki (2005) for more details. About sixty years later, the work of Bates and Granger (1969) popularized the idea and spawned a rich literature on forecast combinations. More than fifty years have passed since Bates and Granger’s (1969) seminal work, and it is now well established that those forecast combinations are beneficial. They offer substantially improved forecasts on average relative to constituent models; see Clemen (1989) and Timmermann (2006) for extensive literature reviews.

This paper aims to present an up-to-date modern review of the literature on forecast combinations over the past five decades. We cover various forecast combination methods for both point and probabilistic forecasts, contrasting them and highlighting how different related ideas have developed in parallel.

Combining multiple forecasts derived from numerous forecasting methods is often better than identifying a single “best forecast”. These are usually called “combination forecasts” or “ensemble forecasts” in different domains. Observed time series data are unlikely to be generated by a simple process specified with a specific functional form because of the possibility of time-varying trends, seasonality changes, structural breaks, and the complexity of real data generating processes (Clements & Hendry, 1998). Thus, selecting a single “best model” to approximate the unknown underlying data generating process may be misleading and is subject to at least three sources of uncertainty: data uncertainty, parameter uncertainty, and model uncertainty (Kourentzes et al., 2019, Petropoulos et al., 2018a). Given these challenges, it is often better to combine multiple forecasts to incorporate multiple drivers of the data generating process and mitigate uncertainties regarding model form and parameter specification.

Potential explanations for the strong performance of forecast combinations are manifold. First, the combination is likely to improve forecasting performance when multiple forecasts to be combined incorporate partial (but incompletely overlapping) information. Second, structural breaks are a common motivation for combining forecasts from different models (Timmermann, 2006). In the presence of structural breaks and other instabilities, combining forecasts from models with varying degrees of misspecification and adaptability can mitigate the problem, helping to explain the empirical success of forecast combinations. See, e.g., Rossi, 2013, Rossi, 2021 for an extensive discussion on forecast combinations in the presence of instabilities. One can consider the competing forecasts as intercept correction relative to a baseline forecast. This provides potential gains in forecast accuracy if there are either structural breaks or deterministic misspecifications (Hendry & Clements, 2004). Finally, Hendry and Clements (2004) noted that forecast combination could be an application of Stein–James shrinkage estimation (Judge & Bock, 1978). Specifically, if the unknown future value is considered as a “meta-parameter” of which all the individual forecasts are estimates, then averaging has the potential to provide an improved estimate.

In light of their superiority, forecast combinations have appeared in a wide range of applications such as retail (Ma & Fildes, 2021), energy (Xie & Hong, 2016), economics (Aastveit et al., 2019), and epidemiology (Ray et al., 2022). Among all published forecasting papers included in the Web of Science, the proportion of papers concerning forecast combinations has been trending upward over the past 50 years, reaching 13.80% in 2021, as shown in Fig. 1. Consequently, reviewing the extant literature on this topic is timely and necessary.

The gains from forecast combinations rely on not only the quality of the individual forecasts to be combined but the estimation of the combination weights assigned to each forecast (Cang and Yu, 2014, Timmermann, 2006). Numerous studies have been devoted to discussing critical issues concerning the constitution of the model pool and the selection of the optimal model subset, including but not limited to the accuracy, diversity, and robustness of individual models (Batchelor and Dua, 1995, Kang et al., 2021, Lichtendahl and Winkler, 2020, Mannes et al., 2014, Thomson et al., 2019). On the other hand, combination schemes vary across studies. They have evolved from simple combination methods that avoid weight estimation (e.g., Clemen and Winkler, 1986, Genre et al., 2013, Grushka-Cockayne, Jose and Lichtendahl, 2017, Palm and Zellner, 1992, Petropoulos and Svetunkov, 2020) to sophisticated methods that tailor weights for different individual models (e.g., Bates and Granger, 1969, Kang et al., 2021, Kolassa, 2011, Li et al., 2020, Montero-Manso et al., 2020, Newbold and Granger, 1974, Wang, Kang, Petropoulos and Li, 2022). Accordingly, forecast combinations can be linear or nonlinear, static or time-varying, series-specific or cross-learning, and ignore or cover correlations among individual forecasts. Despite the diverse set of forecast combination schemes, forecasters still have little guidance on solving the “forecast combination puzzle” (Chan and Pauwels, 2018, Claeskens et al., 2016, Smith and Wallis, 2009, Stock and Watson, 2004) — simple averaging often empirically dominates sophisticated weighting schemes that should (asymptotically) be superior.

Initial work on forecast combinations after the seminal work of Bates and Granger (1969) focused on dealing with point forecasts (see, e.g., Clemen, 1989, Timmermann, 2006). In recent years considerable attention has moved towards the use of probabilistic forecasts (e.g., Gneiting and Ranjan, 2013, Hall and Mitchell, 2007, Kapetanios et al., 2015, Martin et al., 2021) as they enable a rich assessment of forecast uncertainties. When working with probabilistic forecasts, issues such as diversity among individual forecasts can be more complex and less understood than combining point forecasts (Ranjan & Gneiting, 2010). Additional problems such as calibration and sharpness need to be considered when assessing or selecting a combination scheme (Gneiting et al., 2007). Probabilistic forecasts can be elicited in different forms (i.e., density forecasts, quantiles, prediction intervals, etc.), and the resulting combinations may have different properties such as calibration, sharpness, and shape; see Lichtendahl et al. (2013) for further analytical details.

We should clarify that we take the individual forecasts to be combined as given and do not discuss how the forecasts are generated. We focus on combinations of multiple forecasts derived from *separate* and *non-interfering* models for a given time series. Nevertheless, the literature involves at least two other types of combinations that are not covered in the present review. The first is the case of generating multiple series from the single (target) series, forecasting each of the generated series independently, and then combining the outcomes. Such data manipulation extracts more information from the target time series, which, in turn, can be used to enhance the forecasting performance. Petropoulos and Spiliotis (2021) referred to this category of forecast combinations generally as “wisdom of the data” and provided an overview of approaches in this category. In this particular context, the combination methods reviewed in this paper can function as tools to aggregate (or combine) the forecasts computed from different perspectives of the same data. The second type of forecast combination we do not cover is forecast reconciliation for hierarchical time series, which has developed over the past ten years since the pioneering work of Hyndman et al. (2011). Forecast reconciliation involves reconciling forecasts across the hierarchy to ensure that the forecasts sum appropriately across the hierarchy levels, and hence is a type of forecast combination.

We note that forecast combination and model averaging are sometimes used without distinction in the literature. The two terms overlap, but their focuses are different. “Model averaging” is a general term allowing for model uncertainty, particularly in parameter estimation, which can lead to better estimates and more reliable forecasts and prediction intervals than model selection (selecting a single best model). Several approaches to model averaging have been developed in statistics, econometrics, and machine learning. Two main strands can be identified: frequentist approaches (e.g., Fletcher, 2018) and Bayesian approaches (e.g. Steel, 2020). “Forecast combination” is a more focused terminology describing the combination of forecasts to generate a better forecast; the component forecasts could be outcomes from model averaging, individual models, or expert forecasts, e.g.. As with model averaging, weights can be used to combine the component forecasts. Unlike model averaging, however, forecast combination also has some underlying assumptions to ensure that the forecast combinations are unbiased or optimal.

## Leave a Reply