# Forecasting Air Quality in India through an Ensemble Clustering Technique

**J. Anuradha, S. Vandhana, and Sasya I. Reddi**

## Introduction

Air pollution has been a major cause of various diseases in recent years. The ambient quality of air (outdoor air quality) manifested worldwide deaths of 4.2 million in the year 2016. Prolonged exposure to particulate matter (PM) of diameter less than 2.5 microns can prove to be fatal (WHO 2018). It causes cardiovascular and respiratory diseases and cancers. It is one of the incremental problems affecting all countries in the world.

Rather than affecting human health directly, air pollutants can cause long-term damage to the environment which results in climate change and an indirect threat to health and well-being. Besides exposure to PM, health can be drastically affected by exposure to ozone (O,) as well. One of the major factors of asthma morbidity and mortality is ozone, which can cause reduced lung functioning, lung cancer, and other breathing problems. The air pollutants NO, and PM, _{5} have several correlated activities as shown by Dai, Luo, Luo, Qin, and Peng (2014). One of the main components of nitrate aerosols is NO,. Epidemiological studies have shown that NO, has also been linked to reduced lung function. In a few cities in North America and Europe, the concentration levels of NO, are being monitored regularly. SO, has been known to aggravate chronic bronchitis and asthma, make people more vulnerable to respiratory tract infections, and cause respiratory tract inflammation and eye irritation. On the days when the concentration levels of SO, are higher, the number of cases of cardiac diseases or mortality is higher, as stated by the WHO (2018).

The National Wildlife Federation of the USA in the late 1960s developed the Environmental Quality Index (EQI), of which the Air Quality Index plays an integral part. Many researchers have concentrated on ambient air quality and applied the methods of classification, clustering, prediction, and forecasting for the well-being of humans. Upon clustering or grouping, one can find the exact shape of data or the patterns that it is falling in and so one can take the control measures that reduce the affect on human health in the future.

Various data mining research activities involve clustering as a key step, as it aims at separating information into clusters or classes on the basis of a certain measure of similarity. The final objective is to place the data points in the same cluster which are similar to each other and points that differ in separate clusters. It is well known that different clustering methods when applied to the same set of data can discover different patterns. This is because, due to the optimization of different criteria, each algorithm has its own bias. Another challenge of clustering is related to the validation of the results when ground truth is not available.

Clustering ensembles have appeared in recent years as a strategy for solving some of the clustering problems, as shown by Ghosh and Acharya (2011) and Strehl and Ghosh (2012). An ensemble clustering method is described by its two major components:

- • Generation of diverse partitions;
- • Combining partitions into a final clustering using a consensus function.

The same algorithm with different initializations or the same data with different bootstrap samples or algorithms make a clustering ensemble system. A clustering ensemble offers a solution to various challenges that are inherent to clustering. By using consensus through several clustering tests, more reliable and robust solutions can be provided. Different data samples may induce variance which can lead to spurious structures, which also emerge due to various biases and can be averaged out by the consensus function.

A subspace clustering can also be viewed as a weighted cluster array, in which each cluster represents the importance of its features using a weight vector. In the subspace clustering ensemble, the clusters and the weight vectors given by the base subspace clusters are used in the consensus function. The final consensus clustering can be improved by evaluating the relevance of the base clustering (and the assignments of weights accordingly) as shown by Li, Ding, and Jordan (2007). To the best of our knowledge, Nock and Nielsen (2006) were the first to explore how to use object related weights in the ensemble clustering method.

Researchers analyzed the advantages of calculating objects combined with advanced techniques in iterative clustering approaches, such as a probabilistic model, a K-harmonic mean, and a K-harmonic mean for clustering with a boosting technique, as shown by Hamerly and Elkan (2012), Topchy, Jain, and Punch (2014), and Zhang, Hsu, and Dayal (2000). Empirical observations suggest that objects that are difficult to cluster should be given more weights.

Centroid-based clustering methods involve moving the centers of the clusters to object regions where it is difficult to identify the membership of these objects in a cluster. In boosting, the areas around the difficult points are densified because the distribution of data is biased by the weights. Thus, the centers of the clusters are moved towards the weight-modified distribution modes. Using Bregman divergence, clustering was formulated as a constrained minimization by Nock and Nielsen (2016). Objects that are hard to cluster are recommended to be given large weights. A few algorithms based on weighted versions w'ere introduced and the advantages of using boosting clustering techniques were analyzed.

The rest of the chapter is organized firstly in terms of related works on air quality prediction, ensemble modeling, and ensemble clustering. Then follows dataset description, methodology explanation of the ensemble consensus function and the METIS function, and finally the experimental results to supplement the proposed work.

## Related Works

### Air Quality Prediction

Based on the geographical area, the urban emissions causing air pollution have been analyzed and mapped by Asgari, Farnaghi, and Ghaemi (2017). Apache Spark was used to study Tehran data from 2009 to 2013. In addition, the accuracy was predicted by Naive Bayes and logistic regression algorithms. In conclusion, the data was more accurately forecast by Naive Bayes in comparison to other machine learning algorithms that are used to identify unknown air quality groups. So strong results w'ere obtained in this paper for Apache Spark.

In Zhu, Cai, Yang, and Zhou (2018), optimization and regularization techniques were used to predict the air pollutant values for the next day. The values of the different air pollutants that w'ere predicted in the paper were that of sulfur dioxide, particle matter (PM_{25}), and ozone. Data from the two stations were used for prediction. The values for O, and SO, were predicted by one station and O, and PM_{2 5} by the other. The data was modeled based on similarity and, for grouping, linear regression was used. The evaluation criteria was the root-mean-squared error (RMSE); however, the linear regression model failed to forecast or handle unforeseen events.

Gore and Deshpande (2017) studied the effects of air quality on health and the classification of the air quality index. For classification, they implemented a decision tree, J48, and Naive Bayes, and the results prove that the decision tree algorithm performs better than the rest with an accuracy of 91.9978%. However, this research has certain shortcomings as the dataset is limited and the decision tree cannot perform well on continuous variables and has issues with overfitting. A classification of air quality index was proposed using К-means algorithm. Again the dataset was limited, as the К-means technique was unfit for predicting the future values, as shown by Kingsy, Manimegalai, Geetha. Rajathi, Usha, and Raabiathul (2016).

The limitation of computational models for air quality is discussed by NASA Goddard Space Flight (2018). They proposed different machine learning techniques to forecast the concentration levels of O, in various countries. The dimensionality of the data is reduced in the pre-processing technique (sparse sampling and randomized matrix decompositions). A random forest regression technique is used for forecasting the next ten days. The researchers used only one pollutant (O,) for future prediction and the data subsample size is small. Air pollution prediction using the dynamic neural network (DNN) approach was carried out on data generated by low cost sensors by Esposito, De Vito, Salvato, Bright, Jones, and Popoola (2016).

Deep learning techniques to predict the concentration levels of ozone in smart cities was proposed by Ghoneim et al. (2017). Deep learning using a feed forward neural network was used on the Aarhus city dataset. The model was compared with neural networks (NN) and support vector machine (SVM). This proved that deep neural networks accurately measured the pollution value.

Ozone concentration was studied in Tunisia by Ishak, Daoud, andTrabelsi (2017). Ozone concentration recorded at three stations was considered for prediction. A random forest and support vector regression were applied for prediction. Random forest was found to be more accurate than SVM.

### Ensemble Modeling

Ensemble modeling is a machine learning technique in which several base diverse learning models are used to predict the output. The objective of this approach is to decrease the variance and bias of the model, which reduces the generalization error and also provides stability to the model. In this way performance of the ensemble models are improved. As long as the base models are independent and diverse, the performance is always better. This technique uses the wisdom of collective results from base learners by applying any one of the aggregation methods mentioned in Figure 6.1. With this approach one can construct a strong learner out of weak learners by using one of the following approaches:

- 1. Bagging (reduce variance);
- 2. Boosting (reduce bias);
- 3. Stacking (improves prediction).

Figure 6.2 shows the construction of an ensemble model from various base learners. Here, the training samples are divided into multiple subsets of samples by random selection with replacement. This selection method is call *bootstrap aggregation* or *bagging* and involves repeated random resampling of training data (Breiman 1996). The sub-sample data may have complete features or a subset of features and each of

FIGURE 6.1 List of aggregation functions applied on ensemble model.

FIGURE 6.2 Construction of ensemble model.

these samples are trained with a base learner. Finally the results from all the learners are aggregated. Decision trees and neural networks are unstable learners because even if the training data is changed slightly, the output may change. This variance in the results and the error it may cause is decreased by bagging.

The *boosting* method in ensembling uses a re-weighting approach for the samples during the training phase. This approach is capable of boosting the performance of the weak learners, several of which are involved iteratively during the training of a sample. Results from the different hypotheses are combined to form a strong learner. This was first introduced and later revised into the AdaBoost algorithm by Schapire (1999).

Initially, all samples are assigned uniform weights. Weak learners will train the weighted samples. From the results of the model, based on misclassification, the weights are reassigned to the sample. The weightage for the misclassified samples are increased and for the correctly classified samples it is reduced. Iteratively, in this manner, weak learners are trained. The results from all the base learners are aggregated. This is depicted in Figure 6.3. The boxes in the figure are proportionate to the weights of the sample. The tick and cross symbols represent the correct and incorrect classifications. Every base learner produces a hypothesis. Thus *h _{h}* /г

_{2}, /?,, and /t

_{4}are generated by the four weak learners trained on weighted samples. Finally, the hypotheses are aggregated to generate a single hypothesis. Samples are tested on this hypothesis.

FIGURE 6.3 Re-weightage in the boosting model.

*Stacking* combines several base learners using a meta-classifier or regressor. Base learners are trained on a subset of samples. A meta-classifier/regressor works on top of the base learner. The procedure for stacking is given below.

The training set is split into two sets that are disjoint:

- 1. In one part, several base learners are trained;
- 2. In the other part, the base learners are tested;
- 3. Higher level learners are then trained using input as the predictions from the previous step and the output as the correct responses.

#### Variants of Ensemble Models

Variant approaches for ensemble methods are proposed using fuzzy, neural network, and some statistical approaches. Fuzzy instance weightage assignment for ensemble classification on data streams was proposed to identify the concept drift. This is an adaptive approach that uses a dynamic voting method. This method proposed by Dong et al. has the advantage of less computational cost with better accuracy. It is adaptable and can recognize concept drift.

A dynamic weighted neural network of ensemble models uses a bagging scheme. The dynamic weightage method uses integration of a neural network ensemble classifier. This method overcomes the performances of traditional integration methods (Li et al. 2007). The composite prediction output is obtained by combining various Long Short Term Memory (LSTM) models by dynamically adjusting the combining weights. By using the forgetting weight factor and past prediction errors the combining weights can be updated in a recursive and adaptive way (Choi and Lee 2018).

### Ensemble Clustering

Now, moving on to the clustering techniques, many domains have proven that in comparison to individual classification techniques, a classifier ensemble is more accurate in most cases. This has initiated research work in the area of ensemble methods for clustering. Fred and Jain (2002) generated diverse base clustering results w'ith different initialization of centroids in a К-means clustering algorithm. Similarity between samples is measured by the co-association matrix, w'hich the K-means results are mapped onto. The work was extended by Kuncheva and Hadjitodorov (2014), in which a random number of clusters was chosen for each member that was a part of the ensemble clustering. A procedure for meta-clustering, consisting of two steps, was introduced by Zeng, Tang, Garcia-Frias, and Gao (2012). Initially, all the clusters are converted into a distance matrix. Next, the various distance matrices from the clustering are combined using a hierarchical clustering method, thus introducing a consensus clustering for computing. Hu (2008) generated combined clustering results by using a graph-based partitioning approach. Ayad and Kamel (2013) introduced a graph approach with vertices and edges. Data points are represented by vertices and w'hen a certain number of nearest neighbors are shared between two data points, there exists an edge between the vertices. A random projection of data points is combined w'ith a cluster ensemble by Fred and Jain (2002). Clustering is carried out using the expectation and maximization approach and an agglomerative hierarchical clustering is applied to obtain the concluding results. Greene, Tsymbal, Bolshakova, and Cunningham (2014) focused on generating different integration techniques for input clustering. The dataset used was medical diagnostic data. Base clustering was generated using fast, weak K-medoids and К-means clustering. The aggregation of results is given by a co-occurrence matrix, upon which hierarchical clustering schemes are applied for consensus ensemble clustering. Different hierarchical methods, such as single linkage, complete linkage, mean, median, and ward, were used to cluster the Dengue data (Vandhana and Anuradha 2018). The base clustering results can also be produced from various hierarchical methods.

The normalized mutual information between the clusters is maximized by Strehl and Ghosh (2012), combining the clustering results using a novel consensus function. The three heuristics represent the ensemble clustering as a hyper-graph. Each of the partitionings is represented as a hyper-edge. The three heuristics are: a metaclustering algorithm (MCLA), a hyper-graph partitioning algorithm (HGPA), and a cluster-based similarity partitioning algorithm (CSPA). In CSPA, the inputs for clustering are converted into a binary similarity matrix. For each pair of points, the value is 1 if it belongs to the same cluster, 0 otherwise. A similarity matrix S is generated using the average of all the matrices. The results are re-clustered from S, w'ith a graph-based partitioning approach. The generated similarity graph consists of vertices corresponding to data, and edges represent the weight of the similarity between the vertices. Karypis and Kumar (1998) use METIS for final partitioning.

Hyper-graph partitioning in HGPA is done by cutting the minimal edges. Each cluster from input clustering represents the hyper-edge. Initially, the same weight is assigned to all the hyper-edges. The hyper-edge is chosen by the algorithm such that the hyper-graph is separated into К-components. The initial cluster is approximately similar to the size of К and uses the HMETIS package.

In MCLA. meta-clusters are formed which hold the clustering of clusters. The object is assigned with object-wise weight assignment which provides the cluster membership. From the graph, the hyper-edges are grouped and each data object is assigned to a meta-cluster in which its participation is the strongest.

The model of fuzzy theory is incorporated into consensus clustering framework for improving the final results. The consensus clustering approach based on fuzzy C-means is explored by Punera and Ghosh (2018). The underlying structure of various datasets was discovered by Mok, Huang, Kwok, and Au (2012) based on an ensemble framework using a fuzzy C-means algorithm. A fuzzy consensus function for ensemble clustering was studied by Sevillano, Alias, and Socoro (2012). The biological interpretation of clusters was provided by Avogadri and Valentini (2009) using a random projection technique based on a fuzzy clustering ensemble framework. A hybrid fuzzy cluster ensemble framework was proposed by Yu. Chen, You, Han, and Li (2013). In this framework, a set of associated fuzzy membership functions, the fuzzy C-means algorithm, and the affinity propagation algorithm are integrated.