Supporting Environmental Decision Making Application of Machine Learning Techniques to Australia’s Emissions
- Introduction
- Data and Methodology
- Data
- Methodology
- Decision Trees
- Random Forests
- Extreme Gradient Boosting
- Support Vector Regression
- Data Division and the Experimental Environment
- Optimization of Hyperparameters
- Parameter Tuning for the DT, RF, and XGBoost Algorithms
- Parameter Tuning for the SVR Algorithm
- Performance Metrics
Alex O. Acheampong and Emmanuel B. Boateng
Introduction
This chapter seeks to compare the forecasting ability of machine learning (ML) techniques such as decision tree, random forest, extreme gradient boosting, and support vector regression (SVR) by focusing on Australia’s carbon emissions. Australia is one of the major carbon-emitting countries in the world. The Ndevr Environmental (2019)' report indicates that Australia’s carbon emissions for March 2019 increased to approximately 561 million tonnes of CO, equivalent. The continued increase in carbon emissions, which is the primary greenhouse gas behind climate change, would cause a decline in agriculture yield, damage property and infrastructure facilities. and increase commodity prices and financial instability (Climate Council, 2019). Australia has enacted numerous policies to mitigate carbon emissions; however, it is critical to have a futuristic understanding of the country’s carbon emissions.
Having such an understanding requires the use of advanced modeling or forecasting techniques. In the existing literature, the majority of the studies have utilized classical statistical approaches to model or forecast carbon emissions (see Acheampong & Boateng, 2019). However, given the chaotic, non-linearity, and non-stationarity of variables for modeling carbon emissions, the classical statistical approaches are not appropriate for modeling such complex behavior (Acheampong & Boateng, 2019; Gallo, Conto, & Fiore, 2014). Apart from the classical statistical approaches, other structural simulation approaches, such as computable general equilibrium (CGE) models, the Atmospheric Stabilisation Framework (ASF) model, the Multiregional Approach for Resources and Industry Allocation (MARIA) model, and the National Energy Modelling System (NEMS), have been employed by various organizations to forecast carbon emissions (see Auffhammer, 2007; O’Neill & Desai, 2005; Zhao & Du, 2015). It is argued against these models that, since the value parameters are fixed using personal judgments and calibrations, it may not be able to capture the behavior of the real economy (cited in Zhao & Du, 2015). Further, the Grey Model (GM) has been used as a technique for forecasting carbon emissions; however, Yin & Tang (2013) argue that this model and especially GM (1,1) works well with limited data. Various studies have comparatively analyzed the prediction ability of these models with ML algorithms, such as Artificial Neutral Network (ANN), and have revealed that ML techniques forecast better than such models (Falat & Pancikova, 2015; Stamenkovic, Antanasijevic, Ristic, Peric-Grujic, & Pocajt, 2015; Valipour, Banihabib, & Behbahani, 2013).
Recently, the advancement in ML techniques has played a critical role in decision making. For instance, ML algorithms such as decision tree, random forest, extreme gradient boosting, and SVR have played a critical role in modeling and forecasting. Although there have been studies evaluating the forecasting ability of classical statistical models and ML techniques, research comparing the forecasting ability of different ML algorithms by focusing on carbon emissions and more specifically on Australian’s carbon emissions remains scarce. This warrants further empirical study. Therefore, in this chapter, we aim to evaluate the forecasting ability of the above four ML techniques by focusing on Australia’s carbon emissions.
This study also contributes to the literature by employing macroeconomic variables, such as population size, economic growth, energy consumption, trade, financial development, foreign direct investment, and urbanization, which are mostly used as inputs to model carbon emissions (see Acheampong, 2018, 2019; Acheampong, Adams, & Boateng, 2019; Acheampong, Mary &, Boateng, 2020; Adams & Acheampong, 2019). To achieve robust and efficient estimates, this study uses high- frequency data. Finally, the outcome of this study will inform environmental policymakers as to the best ML technique for forecasting carbon emissions in Australia. The rest of the chapter is organized as follows. In Section 9.2, the methodology and data are presented, while the results are discussed in Section 9.3. Conclusions and policy implications are presented in Section 9.4.
Data and Methodology
Data
To compare the performance accuracy of the decision tree, random forest, extreme gradient boosting, and SVR, this study utilized a dataset for the period 1960-2018. Following Acheampong & Boateng (2019), we converted the data from an annual dataset to a quarterly dataset using the quadratic sum approach. Therefore, quarterly data, which range between 1960Q1 and 2018Q4, was used for the analysis. The output and input variables used are presented in Table 9.1. Economic growth, energy consumption, financial development, foreign direct investment, physical investment, population size, trade, and urbanization are used as the input variables for modeling carbon emissions.^{2} All the variables were obtained from World Bank (2019).^{3}
Methodology
Decision Trees
Decision trees (DTs) are non-parametric supervised learning techniques applied to regression and classification problems (Das, Naik. & Behera, 2020a; Das et al. 2020). These tree-based models are popular and advantageous in handling smaller datasets than neural network models. Through a repetitive process of splitting, the regression trees can yield a set of rules which can be used for prediction (Das et al., 2020b; Tso & Yau, 2007). The splitting process divides the sample into two or more homogeneous sets derived from the most important differentiator among the input variables. In the case of classification problems, metrics such as cross-entropy or the Gini index are used to decide strategic splits for DTs (Das et al., 2020; Xu et al., 2005). For regression problems, DTs normally use the mean squared error (MSE) criterion for splitting a node into two or more sub-nodes. That is, on each subset of data, the algorithm computes an MSE value and the tree with the least MSE is selected as a point of split. The concluding outcome comprises decision nodes and leaf nodes (Das, Naik, & Behera, 2020a, 2020b). By contrast with black-box models such as deep learning models, DTs are easy to understand and interpret because their rules can be visualized. These algorithms have been applied and attained success in many fields due to their efficiency and interpretability (Tsai & Chiou, 2009; Wu, 2009).
TABLE 9.1
Variables Used in This Study
Variables |
Proxies |
Mean |
sd |
min |
max |
Economic growth |
GDP per capita (constant 2010 US$) |
36.690.75 |
11.635.62 |
19.245.41 |
56.919.38 |
Energy use |
Energy use (kg of oil equivalent per capita) |
4805.486 |
808.1064 |
3063.554 |
5964.666 |
Population size |
Population, total |
1.70E+07 |
4115748 |
1.03E+07 |
2.50E+07 |
Financial development |
Domestic credit to private sector (% of GDP) |
64.06122 |
41.96032 |
17.65457 |
142.2841 |
Urbanization |
Urban population growth (annual %) |
1.644195 |
0.569214 |
0.768615 |
3.571777 |
Trade openness |
Trade (% of GDP) |
34.12007 |
6.410204 |
24.79318 |
45.7979 |
Physical capital investment |
Gross capital formation (% of GDP) |
27.83713 |
2.839597 |
22.39053 |
33.68189 |
Foreign direct investment |
Foreign direct investment, net inflows (% of GDP) |
2.171749 |
1.544092 |
-3.61882 |
7.005444 |
Carbon emissions |
CO, emissions (kt) |
252.419.8 |
92.195.4 |
88.202.35 |
394.792.9 |
Random Forests
A random forest (RF) is an ML algorithm that combines several DT models to effectively classify or predict an outcome (Breiman, 2001; Das, Naik. & Behera, 2020a, 2020c). This combination process, also termed “bootstrap aggregation” or “bagging,” involves training each DT with a distinct set of observations through sample replacement. Samples which do not end up in a subset of training data during bagging are included with other subsets called “out-of-bag” (Rodriguez-Galiano et al., 2015). Bagging minimizes the variance of the base learner; however, it has minimal influence on the bias (Rodriguez-Galiano et ah, 2015). Basically, on each node, there is a random selection of variables out of all possible variables, then the best split among the selected variables is determined, based on the lowest MSE. The final prediction is derived using an ensembling technique, which averages the predictions of the previous regression trees (Das et ah, 2020). Due to the averaging of several trees, there is a considerably lower risk of overfitting (Breiman. 2001). The random sampling of training observations and random subsets of candidate variables for splitting nodes is a clear distinguishing factor of RF from DT.
Extreme Gradient Boosting
Extreme gradient boosting (XGBoost) is a scalable end-to-end tree-boosting algorithm (Chen & Guestrin, 2016). It can be applied to regression, ranking, and classification problems. This algorithm performs parallel tree learning using a novel sparsity-aware system (Chen & Guestrin. 2016). XGBoost employs the second- order Taylor expansion to approximate the loss function. This model has been known to outperform other ML models, and its success can be witnessed in numerous ML and data-driven competitions such as Kaggle (Dey et ah, 2019; Rout et ah, 2020). The fundamental factor behind the triumphs of XGBoost is its scalability in all circumstances (Chen & Guestrin, 2016). That is, using an optimal amount of resources, the algorithm yields state-of-the-art results when solving a wide range of problems. The implementation of this model is, thus, influenced by its high speed and performance.
Support Vector Regression
A support vector regression (SVR) is a type of support vector machine used for regression purposes and hence handles continuous values (Das et ah, 2020; Dey et ah, 2019). It follows the same principles as support vector classification, though with minimal modifications. The most important distinguishing factor between a simple linear regression model and SVR is that the former attempts to reduce the error rate while SVR tries to fit the error within a designated threshold. This kernel- based model has advantages in high dimensional spaces since its optimization does not rely on the dimensionality of the input area (Das et ah, 2018; Drucker, Burges, Kaufman, Smola, & Vapnik, 1997). Moreover, SVR is a non-parametric tool and does not rely on distributions of the primary input and output variables. Although less popular than support vector classification (SVC), SVR has proven to be an effective technique in solving real-world scale problems (Awad & Khanna, 2015).
Data Division and the Experimental Environment
We compared the accuracies and computational costs of these four ML algorithms. Data used for training and validating the models were standardized to eliminate instances of one variable dominating another (Boateng, Pillay, & Davis. 2019), since the variables used in this study have different units (Acheampong & Boateng, 2019; Bannor & Acheampong, 2019). Eighty percent (188 quarters) of the data were used to train each model, while the remaining 20% (48 quarters) were used to validate the models. Similar data proportions were used by Morano et al. (2015); Lam et al. (2008); Acheampong & Boateng (2019); and Bannor & Acheampong (2019). For the hardware and software environment, we used an Intel i5-2520M (4) at 3.2 GHz CPU, and an 8 GB memory operating on Ubuntu 18.04.2 LTS. Two graphic processing units (GPUs) were used, an Intel 2nd Generation Core Proce, and an NVIDIA GeForce GTX 1050Ti with 8 GB memory. We used Spyder (Python 3.6.7) to write and execute the programming codes.
Optimization of Hyperparameters
As the goal of this study is to evaluate predictive ML models, there is the need to develop optimal models suitable for comparison purposes. We performed a grid search with ten-fold cross-validation on the hyperparameters of each model. This technique shuffles and resamples the training data into ten equal folds, fits the model on a combination of one set of hyperparameters on nine folds, and tests the model on the remaining fold (Bannor & Acheampong, 2019). A carefully tuned model is at lower risk of underfitting and overfitting problems. The best score function returns a combination of hyperparameters and associated arguments suitable for developing the model. For all models, we assessed their mean cross-validation scores (MCVs). The highest MCV is used as the basis to select the ideal combination of hyperparameters.
Parameter Tuning for the DT, RF, and XGBoost Algorithms
In prior experimentation, certain hyperparameters were deemed to influence the performance of the models significantly and hence were used in tuning the ML algorithms; in the case of the DT algorithm, five different maximum leaf nodes (none, 2, 3, 5, and 7), five maximum tree depths (1, 3, 5, 7, and 9), and five minimum samples for a leaf node (1,3,5, 7, and 9) over the ten folds of data results in 1250 models. We used a random state of zero for the DT regressor. For the RF algorithm, five numbers of estimators/trees (100, 150, 200, 250, and 300), five maximum tree depths (1, 3, 5, 7, and 9), and five minimum samples for a leaf node (1,3,5, 7, and 9) were specified. This also led to 1250 models. The hyperparameter arguments of the XGBoost, such as the number of estimators (100, 150, 200, 250, and 300), the number of maximum tree depths (1, 3, 5, 7, and 9), and the learning rates (0.01, 0.05, 0.1, 0.2, and 0.3), were also tuned, totaling 1250 models.
Parameter Tuning for the SVR Algorithm
For the SVR algorithm, six penalty “C” parameter arguments (10.0, 50.0, 100.0, 1000.0, 1050.0, and 1100.0), seven different gamma values (0.0005, 0.0001, 0.001,
0.01.0.1, 1.0, and 10.0), and three types of kernels (radial basis function, linear function. and polynomial function) were experimented with over ten folds of the training dataset using the grid-search framework. In all, 1260 models were developed for the SVR. After the grid search with the ten-fold cross-validation exercise, we selected the ideal hyperparameter arguments for each algorithm based on their MCV scores.
Performance Metrics
We assessed the accuracies of each algorithm in predicting the 48 carbon emission data points in the test dataset. By evaluating the deviations between the predicted and actual emissions, models with lower errors were ranked high in terms of accuracy. The root mean squared error (RMSE), the coefficient of determination (R^{2}), and the mean absolute percentage error (MAPE) were used in comparing the levels of deviation among the four ML algorithms. We also assessed the computational efficiency of the four models. In particular, the elapsed time taken during the grid search with a ten-fold cross-validation process on each algorithm was measured.