 # Sampling from Data Streams

Due to the unbounded size of the data streams and low memory, we need some efficient ways to reduce the data footprint. In the literature, researchers usually use the following techniques to reduce the data load.

• • Sampling.
• • Sliding window.
• • Histogram.
• • Wavelets and Fourier transforms.

We have already seen some examples of sliding windows and they find application at several places such as the DGIM method (Rajaraman and Ullman, 2011) for counting Is in a bit stream and so on. In this section, we are going to discuss a sampling method from data streams.

There are many ways to sample from a data stream. For example, we can use random uniform sampling which is the simplest algorithm. However, doing random uniform sampling in a data stream is not straightforward since all items in the stream are not available at once. Hence, there is another algorithm called reservoir sampling used to sample items from a data stream such that the probability of selecting an item is 1 /n where n is the stream size.

The intuition behind reservoir sampling is as follows. Imagine a swimming pool which can have at most к people in it at any one time. Swimmers are coming continuously and any new swimmer can only enter the pool after a previous swimmer has come out. But, how do you decide which new swimmer to let in? One solution could be to toss a coin (biased) with probability p = k/n where n is the number of swimmers who have arrived so far. If it comes up heads, let the swimmer in, otherwise deny entry. A reservoir sampling algorithm is given in Algorithm 1.2.

Algorithm 1.2: Reservoir Sampling Algorithm input S: stream of values, k: size of the reservoir

/* Create uniform sample of fixed size */

• 1 Insert first к elements of S to the reservoir.
• 2 for v e S do
• 3 Let i be the index of v in S
• 4 Generate a random int j e [ 1, /]
• s if j < к then
• 6 Insert v into the reservoir
• 7 Delete an item from the reservoir at random
• s end
• 9 end

The algorithm follows the intuition as given above. At the beginning, к items in the stream are inserted into the reservoir. For the other incoming items, a random number is generated between 1 and /' w'here / is the index of the items just arrived. If the random number thus generated lies in [1,&], we insert the new item in the reservoir by deleting an item uniformly at random. Otherwise, discard the new' item. The question that comes up is: What is the probability that the /th item will replace an old item?

Theorem 1.1

The probability that the new incoming item will replace the old one is given by P(a, G R) = k/n, where R denotes the reservoir, к the reservoir size, and n the stream size.

Proof: Note that the item / has probability k/i of being inserted into the reservoir R. (Why?) After that no other item should replace it otherwise it will be gone.

Now, the next (/ + 1 )th item will not replace it, and has probability ^l-т—yj; similarly, the (/ + 2)th item will not replace it and has probability ^1 and so on.

Since all of these probabilities are independent (?), we multiply all of them to get

the desired result. P(a, e/?) = — x—— xiilx-—- = — □

v ' i i +1 / + 2 n n

Thus, we have proved that by using reservoir sampling we can obtain samples w'hich are unbiased. Thus samples can be used for downstream tasks.

# Concept Drift Detection in Data Streams

Real-life data is continuously changing due to the dynamics of the complex environment. As a result, the assumption which most traditional learning algorithms make, namely that the data is independent and identically distributed, does not hold.

The problem is more severe in the case of data streams because of the high data rate and the non-availability of all the data points.

More formally, in data mining and machine learning, we try to find the function/ that maps input x to output y. Such a function is assumed to be stationary’, i.e., the distribution generating the data is fixed (but unknown). But, real-life data is:

• • Non-stationary.
• • Evolving.

Therefore, we need methods that can detect and adapt to changes in the underlying function. The underlying function is called a concept and changes in the function is called concept-drift, also known as change detection, covariate shift, dataset shift, etc.

We should make a difference between noise and concept-drift here. A change in the underlying function is due to the noise that occurs when there are not sufficient examples present in the change region or they cannot form a new concept of their own. The main difficulty lies in determining the threshold when changes in the data form a new concept as opposed to remaining as noise. An example of concept-drift is shown in Figure 1.6. As we can see in the figure, there is a change in the variance of the signal; so it is a concept drift of signals.

## Causes of Concept Drift

Often, the causes of the drift are unknown. However, most researchers try to find causes based on the changes in the model. An explanation of the causes is given by [Kelly et al., 1999]. Let us look at the Bayesian decision formula used in naive Bayes. FIGURE 1.6 Concept drift in signals. where D and c denote the data and class labels. A change can then be attributed to:

• • Class prior p(c) may change over time;
• • Likelihood p(Dlc) might change;
• • Posterior p(c/D) can change.

The first two changes are called virtual concept drift whereas the last one is called a real concept drift. A further illustration of the three causes is given in Figure 1.7. We can see that the original decision boundary shifts its location due to changes in the class prior to (b) (distribution of the red class changes). In Figure 1.7(c), the decision FIGURE 1.7 Effect on (a) original decision boundary due to change in (b) prior class, (c) likelihood, and (d) posterior class (Wang et al., 2018).

boundary remains unaltered. In Figure 1.7(d), the decision boundary changes drastically due to the change in the posterior probability. As a result, the previously learned decision function becomes useless or not effective. That is why handling concept drift due to a change in posterior probability is more difficult to tackle.

## Handling Concept Drift

In the literature, there are many ways to combat concept drift. Mostly, they are characterized in the following ways (Figure 1.8) (Gama, 2010):

In the data management way of handling concept drift, the learning mechanism makes use of memory to keep the data. In a full-memory mechanism, sufficient statistics are maintained over the entire data set by different kinds of weighting, and a model is trained on that. In a partial-memory mechanism, data is stored in the window' and a model is built over it. In detection methods, the techniques focus on the learning mechanism such as how to detect the change and rate of change. These methods usually detect changes by monitoring them in the performance metrics or maintaining the time w'indow'. An example of the first approach is given by Klinkenberg and Joachims (2000), and the second approach is given in Kifer et al. (2004).

In the adaptation method, the learner is adapted to track changes in the data evolution either in an informed way or blindly. In the former case, the learner updates itself only w'hen a change has indeed occurred, whereas in the latter case, the learner updates itself no matter w'hether a change has occurred or not. In decision model management, the technique characterizes the number of decision models kept in memory. It assumes that the data comes from several distributions and tries to learn a new' model whenever a change is detected. The dynamic weighted majority algorithm (Kolter and Maloof, 2007) is an example of decision model management.

In the next sections, we will look at several examples of the change detection algorithms.

### CUSUM Algorithm

Let us look at a classic change detection algorithm called CUSUM (Cumulative SUM) (Basseville and Nikiforov, 1993). This monitors the cumulative sum of the log-likelihood ratio for detecting a change. Consider a sequence of independent random variables x, with pdf pO(x). Our parameter is в which before a change is 00 and after a change is 0,. Assume 0O is known. The log-likelihood ratio is defined by Formally, let S, be the current cumsum of the log-likelihood and m, the current minimum value of Sr The CUSUM compares this difference with a threshold S as follows: where So the change point is: Note that the detection rule mentioned above is a comparison between the cumulative sum S, and an adaptive threshold in, + S. Due to the online update of in,, the threshold is not only modified online, but also keeps a complete memory of the entire information contained in the historical observations. An example of applying CUSUM to detect changes in the cosine signal is shown in Figure 1.9. As we can see, at every negative and positive change, CUSUM demarcates the changes. However, since the change is slow, the actual alarm triggers after a delay (marked with a red dot). The bottom figure shows the cumulative changes in the positive and negative direction with color-coding. Note that the figures generated use a variant of the CUSUM presented above. They have been plotted using the cumulative sum of the difference of the values with a correction factor and then comparing with the threshold.

### The Higia Algorithm

Higia (Garcia et al„ 2019) is an online clustering algorithm for concept drift and novelty detection. The main idea of the algorithm is based on the clustering of evolving data streams using micro-clusters proposed in Aggarwal et al. (2003). Let us first understand the intuition behind the algorithm, then we will present its formal description. The Higia algorithm proceeds in two steps: an offline phase and an online phase. In the offline phase, it creates micro-clusters using an initial set of labeled data. Then, for each incoming data point, it calculates its distance from the centroid of the microclusters. If the distance is more than a user-specified threshold, it puts it in a buffer and, after enough points, declares it to be a novelty. The concept drift is declared when a new incoming data point increases the radius of the cluster. FIGURE 1.9 (Top) Time series data from cosine function along with marked beginning and end of changes. (Bottom) Cumulative sum plot of positive changes. FIGURE 1.10 Higia Offline step.

Now we will look at the algorithm more formally. As shown in Figure 1.10, a set of initial labeled data is clustered following the same strategy as in Aggarwal et al. (2003) using the CluStream algorithm. The next step is an online phase as shown in Figure 1.11. In the online phase, the new incoming data point is fit to the model. Two things can happen here. If the new data point fits the model, two more situations arise. Either the new point is within the radius of a micro-cluster or outside. If the point is within the nearest micro-cluster, the data point is labeled with that of the micro-cluster. However, if it is outside but still within some factor (user-specified parameter) of the radius, the radius of the micro-cluster is updated and a new data point is labeled with that of this micro-cluster. The second case here is an example of concept drift. On the other side, if the data point does not fit the model well, it is put FIGURE 1.11 Higia Online.

into a buffer. When the buffer has received enough data points, the CluStream algorithm is run and the new cluster is put into the trained model.

The Higia is given in Algorithm 1.1. We will illustrate the algorithm more formally. Input to the algorithm is Xlr which is the new test point. T and к are user parameters. In the second step, щ is a list of к nearest micro-clusters. If the majority of the clusters have the same label, we look for the centroid c, of the nearest micro-cluster Cj and denote its radius by radius(Cj). If the Euclidean distance of the point from the centroid is within the radius, update the cluster with the new point and classify the new point with the label of this cluster (lines 7-10). Otherwise, if the Euclidean distance is larger than T times that of the radius of the cluster, then create an extension of the cluster and classify the new data point with the same label (lines 11-13). If the above two scenarios have failed, the new data point in the buffer needs to classify the new data point as an unknown class level (lines 15-16).

Below, we show some empirical results of running the Higia algorithm on some benchmark data sets (Figure 1.12). Accuracy versus time is plotted in Figures 1.13 FIGURE 1.12 Higia algorithm. FIGURE 1.13 Accuracy over time plot for Higia and baselines. Image source: Garcia et al. (2019). FIGURE 1.14 Accuracy over time plot for Higia and baselines. Image source: Garcia et al. (2019).

and 1.14. MINAS and kNN serve as a baseline algorithm. The performance of the Higia algorithm is shown in green. As we can see, Higia outperforms other algorithms over time in three out of four cases. kNN performs the worst because it does not learn over time.

### Dynamic Weighted Majority Algorithm

Dynamic weighted majority (Kolter and Maloof, 2007) abbreviated as DWM is an algorithm for concept-drift tracking. It is based on the majority voting of experts. The fundamental idea behind DWM is that it dynamically creates and removes weighted experts based on their performance. The algorithm works according to four ideas: a train ensemble of learners; weighting them based on their performance; removing experts based on the performance; and adding new' learners based on the global performance.

The DWM algorithm is presented in Algorithm 1.3.1 will explain each step of the algorithm. There are m experts. The training pair is denoted by and there are c classes. The {a which denotes the sum of weighted predictions to 0 (line 5). In the second for loop, we iterate over the experts, making a prediction for each data point and decreasing its weight if it makes a mistake (lines 7-10). In line 11, we update the weighted prediction vector. Line 13 sets the global prediction of // weight. In the if part (lines 14-22), we normalize the weights. The reason for doing this is that some experts might be always right and their weights can go really high (due to the addition of weights in line 11). If the weight of an expert falls below the threshold «, it is removed from the set of experts (line 16). If the global prediction weight differs from the true label, that means the whole set of an expert is not good and this is the time to create a new expert. Finally, in lines 23-25, we train the new experts and return the global prediction weight in line 26. DWM is a general algorithm for coping with concept drift. One can use any online learning algorithm as the base learner, such as a decision tree, naive Bayes, or a neural network, within the framework.

Algorithm 1.3: Dynamic Weighted Majority input {x, y}j,: training data, c e N, C > 2: number of classes, «^threshold for deleting

expert, /?:factor for decreasing weights, q. time interval for removing, adding and weight update, {e, w}j„: set of experts and their weights, rjg, гц : global and local predictions, cr 6 R<:: sum of weighted predictions for each class

• 1 m <— 1
• 2 em <— Create_New_Expert()
• 3 wm «- 1
• 4 for i 6 (1.....n} do
• s cr <— 0
• 6 for j 6 {1.....m} do
• 7 4l *— Classify(ej, Xi)
• s if г/, Ф yi and i mod q==0 then
• 9 | Wj «— pWj

to end

• 11 crm «- crm + Wj
• 12 end
• 13 T)g <— argmaxj
• 14 if i mod q == 0 then

is w <— Normalize_Weights(w)

• 16 {e, w} «— Remove - Expert{{e, w}, a)
• 17 if гц Ф y, then

is m «— m + 1

• 19 em «— Create_New_Expert()
• 20 wm «— 1
• 21 end
• 22 end
• 23 for j e {1.....m} do
• 24 | ej <-Train(ej,Xi,y()
• 25 end
• 26 reutrn rjg
• 27 end

# Discussion

Data stream mining is a challenging task because of the complex dynamics involved. In this chapter, I have presented some algorithms for some basic problems in streaming data such as filtering and counting. We also discussed how to sample from a data stream efficiently as well as handling concept drift. We can see that sampling and concept drift are some of the fundamental problems affecting other stream mining tasks such as classification, clustering, and novelty detection. Therefore, having a good sampler for streaming data off-the-shelf makes downstream tasks easier without worrying about devising a new algorithm for handling them separately. On the other hand, concept drift is another problem which often comes with class imbalance or novelty detection. DWM and CUSUM are classic algorithms that can be used to track concept drifts efficiently. Higia is a state-of-the-art model for such problems. There are still unsolved problems in big streaming data such as privacy preservation, handling incomplete and delayed information, and analysis of complex data to name a few (Krempl et al., 2014).

# References

Aggarwal, С. C. (2007). Data Streams: Models and Algorithms, volume 31. Springer Science & Business Media, Boston. MA.

Aggarwal, С. C.. Watson, T. J.. Ctr, R.. Han. J.. Wang, J.. and Yu. R S. (2003). A framework for clustering evolving data streams. In VLDB, Riva del Garda, Italy, pp. 81-92. Basseville. M. and Nikiforov, I. V. (1993). Detection of Abrupt Changes: Theory and Application, volume 104. Prentice Hall, Englewood Cliffs, NJ.

Bhat. A. (2016). Use the bloom filter, luke! https://www.semantics3.com/blog/use-the-bloom- filter-luke-b59fd0839fc4/. Accessed September 30, 2019.

Chaudhuri, S.. Ding, B.. and Kandula, S. (2017). Approximate query processing: No silver bullet. In Proceedings of the 2017 ACM International Conference on Management of Data, pp. 511-519. ACM, New York.

Cormode, G. and Muthukrishnan. S. (2004). Improved data stream summaries: The count-min sketch and its applications. Journal of Algorithms, 55( 1 ):58—75.

Das, H„ Dey. N.. and Balas, V. E. (2019). Real-Time Data Analytics for Large Scale Sensor Data. Academic Press, London. UK.

Dey. N.. Das. H., Naik, B., and Behera. H. S. (2019). Big Data Analytics for Intelligent Healthcare Management. Academic Press, Boston, MA.

Flajolet, P, Fusy, Ё.. Gandouet, O., and Meunier, F. (2007). Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science, pp. 137-156. DMTS, Nancy.

Flajolet, P. and Martin, G. N. (1985). Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31 (2): 182-209.

Gama, J. (2010). Knowledge Discovery from Data Streams. Chapman and Hall/CRC. Boca Raton. FL.

Garcia, K. D.. Poel. M.. Kok. J. N.. and de Carvalho, A. C. (2019). Online clustering for novelty detection and concept drift in data streams. In EPIA Conference on Artificial Intelligence, pp. 448-459. Springer, Cham. Switzerland.

Kelly, M. G.. Hand, D. J., and Adams, N. M. (1999). The impact of changing populations on classifier performance. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 367-371. ACM. New York.

Kifer, D., Ben-David, S., and Gehrke. J. (2004). Detecting change in data streams. In Proceedings of the Thirtieth International Conference on Very Large Data Bases (VLDB '04), volume 30, pp. 180-191. VLDB Endowment.

Klinkenberg, R. and Joachims. T. (2000). Detecting concept drift with support vector machines. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML), pp. 487-494. Morgan Kaufmann, San Francisco, CA.

Kolter. J. Z.. and Maloof. M. A. (2007). Dynamic weighted majority: An ensemble method for drifting concepts. Journal of Machine Learning Research, 8:2755-2790.

Krempl, G., Zliobaite. I„ Brzeziriski, D., Htillermeier, E., Last, M.. Lemaire. V., Noack. T., Shaker. A.. Sievi, S., Spiliopoulou, M. et al. (2014). Open challenges for data stream mining research. ACM SIGKDD Explorations Newsletter, 16(1): 1-10.

Li. K. and Li. G. (2018). Approximate query processing: What is new and where to go? Data Science and Engineering, 3(4):379-397.

Rajaraman, A. and Ullman. J. D. (2011). Mining of Massive Datasets. Cambridge University Press, Cambridge, UK.

Wang, S., Minku, L. L., and Yao, X. (2018). A systematic study of online class imbalance learning with concept drift. IEEE Transactions on Neural Networks and Learning Systems, 29( 10):4802-4821.

2 Decoding Common Machine Learning Methods