The darima paper is published in the International Journal of Forecasting

Authors: Xiaoqian Wang, Yanfei Kang, Rob J Hyndman and Feng Li

Providing forecasts for ultra-long time series plays a vital role in various activities, such as investment decisions, industrial production arrangements, and farm management. This paper develops a novel distributed forecasting framework to tackle challenges associated with forecasting ultra-long time series by utilizing the industry-standard MapReduce framework. The proposed model combination approach facilitates distributed time series forecasting by combining the local estimators of ARIMA (AutoRegressive Integrated Moving Average) models delivered from worker nodes and minimizing a global loss function. In this way, instead of unrealistically assuming the data generating process (DGP) of an ultra-long time series stays invariant, we make assumptions only on the DGP of subseries spanning shorter time periods. We investigate the performance of the proposed distributed ARIMA models on an electricity demand dataset. Compared to ARIMA models, our approach results in significantly improved forecasting accuracy and computational efficiency both in point forecasts and prediction intervals, especially for longer forecast horizons. Moreover, we explore some potential factors that may affect the forecasting performance of our approach.

Advances in technology have given rise to increasing demand for forecasting time series data spanning a long time interval, which is extremely challenging to achieve. Attempts to tackle the challenge by using MapReduce technology typically focus on two mainstream directions: combining forecasts from multiple models and splitting the multi-step forecasting problem into H (forecast horizon) subproblems. On the other hand, the statistical computation can be implemented on a distributed system by aggregating the information about local estimators transmitted from worker nodes. The approach results in the combined estimator proven to be statistically as efficient as the global estimator. Inspired by the solution, this study provides a new way to forecast ultra-long time series on a distributed system.

One of our developed framework highlights is that the distributed forecasting framework is dedicated to averaging the DGP of subseries to develop a trustworthy global model for time series forecasting. To this end, instead of unrealistically assuming the DGP of time series data remains invariant over an ultra-long time period, we customize the optimization process of model parameters for each subseries by only assuming that the DGP of subseries stays invariant over a short period, and then aggregate these local parameters to produce the combined global model. In this way, we provide a complete novel perspective of forecasting ultra-long time series, with significantly improved computational efficiency.

This study focuses on implementing the distributed time series forecasting framework using general ARIMA models that allow the inclusion of additional seasonal terms to deal with strong seasonal behavior. Nevertheless, it is also possible to apply the framework with other statistical models, such as state-space models, VAR models, and ETS, as a general loss function is considered. However, special concerns should be given on how to properly convert the local estimators to avoid the inefficiency in the combined estimators. In this work, we restrict our attention to ARIMA models and involve a linear transformation step, in which ARIMA models trained on subseries are converted into linear representations to avoid the stationary, causality, and invertibility problems that may be caused by directly combining the original parameters of ARIMA models. Similar to ARIMA models, ETS models share the virtue of allowing the trend and seasonal components to vary over time. We hope to shed some light on using distributed ETS models in the future.

The forecasting performance of the distributed ARIMA models is affected by various factors. Two factors, the number of split subseries and the maximum values of model orders, are taken into consideration. Our results show that the number of subseries should be limited to a reasonable range to achieve improved performance in point forecasts and prediction intervals. Specifically, we recommend that subseries’ length ranges from $500$ to $1200$ for hourly time series. Moreover, compared to ARIMA models, a smaller maximum value of model order is sufficient for the distributed ARIMA models to fit models for all subseries and obtain improved forecasting results according to the combined estimators.

Many other potential factors may hold sway over the forecasting performance of our proposed approach. For example, whether to set an overlap between successive subseries may be a practical consideration when implementing the proposed distributed forecasting framework. Through repeated surveys, explore the effect of whether to overlap the random samples at each period on the estimation of population parameters. They illustrate that considering the overlap between samples offers reductions in the variance; they also discuss the optimum proportion of overlap. Therefore, we believe that a study on setting overlap between successive subseries will further improve our framework, and our framework and computer code are generally applicable to such a scenario. To take another example, we may consider adding time-dependent weights for each subseries when combining the local estimators delivered from a group of worker nodes. The time-dependent weights for subseries help assign higher weights to subseries closer to the forecast origin, while smaller weights to subseries that are further away from the forecast origin.

Links: Working Paper | Spark Implementation

The darima paper is published in the International Journal of Forecasting

Comments

Leave a Reply Cancel reply