# Data Stream Mining for Big Data

**Chandresh Kumar Maurya**

## Introduction

There are two kinds of learning, based on how the data is processed. In *batch* learning, the data is processed in chunks and often *offline.* Another type of learning is called *online* learning, usually performed on *streaming* data. Another name for online learning is *incremental* learning. Our focus in this chapter will be incremental learning on streaming data.

A data stream is characterized by certain properties according to Gama (2010):

- • Unbounded size:
- • Transient (lasts for only few seconds or minutes);
- • Single-pass over data;
- • Only summaries can be stored;
- • Real-time processing (in-memory).
- • Data streams are not static:
- • Incremental/decremental updates;
- • Concept drifts.
- • Temporal order may be important.

FIGURE 1.1 General stream processing model.

Traditional algorithms developed for offline processing of the data are not suitable for streaming data due to these issues. There are a few models that are incremental such as К-nearest neighbors (KNN) and naive Bayes. But, still, these models cannot cope with all the issues present in streaming data. For example, if there is a concept drift present in the data, a separate concept-drift detection and alleviation technique needs to be developed. Many researchers have proposed various models for streaming data which tackles some of the issues above. We will look at some of these techniques in detail in the coming sections.

As shown in Figure 1.1, a stream processor takes as input a stream of data such as 0,

1,1,0,1.....The processor can sample from the stream and process it in main memory,

such as when answering a query, and the result of the processing can be dumped into the back-end, such as the hard drive, for downstream tasks if required. Most of the algorithms presented in the subsequent sections will be based on this generic model.

Data stream mining finds application in several domains such as answering user queries over the web, sensor data (Das et al., 2019; Dey et ah, 2019), analyzing network packets, patient health monitoring, and surveillance systems to name a few. Because of the vast application area of the field, I will present some selected case studies in the field at the end of the chapter.

The rest of the chapter is as follows. Research issues related to stream mining are discussed in Section 1.2. Section 1.3 elucidates the simple problems in streaming data such as filtering and counting. Sampling from a data stream is explained in Section 1.4, followed by concept drift detection in Section 1.5. Finally, the chapter ends with a discussion and an open problem in Section 1.6.

## Research Issues in Data Stream Mining

Knowledge discovery from a data stream is a challenging task due to reasons mentioned in the previous section. Most challenging is the unbounded size of the stream, low memory, and fast arrival rate. As a result, techniques proposed in the literature

FIGURE 1.2 (Left) Sequence-based windows. Top figure is a landmark window and bottom figure is a sliding window (used in packet transmission). (Right) Timestamp-based windows. Top figure is a natural tilted window and bottom figure is a logarithmic time window.

rely on ways to summarize the data stream in some way. The following are the techniques often employed to handle data streams.

- 1.
**Approximate query processing**(AQP) is described according to the following rules (Chaudhuri et al., 2017): (a) a query language should be generic enough to retrieve all the relevant information, (b) should have a low error rate and high accuracy, (c) the work should be saved at runtime, and (d) extra resources should be made available for pre-computation. These dimensions of an AQP scheme are not independent and much of the previous literature makes specific choices along the above four dimensions. AQP can be broadly classified into two main categories: (a) online aggregation, which is a sampling in an online manner and using samples to answer queries, and (b) offline synopses generation, which is based on some statistics of the data and w'hich then answers queries (Li and Li, 2018). - 2.
**Sliding window query processing**is a model of handling a data stream w ith the assumption that recent items are more important than the older ones. There are two types of sliding w'indow' based on how they are defined. - •
**Sequence-based sliding window**which is defined according to the number of observations that determine the window size. These are further divided into*landmark sliding window*and*moving sliding window*, as shown in Figure 1.2 (left). - •
**Timestamp-based sliding window:**In this case, the sliding w'indow is determined based on the number of observations made in a given time interval and where w'indow'-width is determined based on the time. They are further categorized into*natural-titled time window*and*logarithmic time window*, as shown in Figure 1.2 (right).

As we shall see, some algorithms based on the sliding w'indow guarantee an error rate of approximation to any desired accuracy *e.* One of the most popular algorithms in this category is the DGIM method (Rajaraman and Ullman, 2011) for counting Is in a bit stream.

3. **Sampling **is another popular approach for making an approximation of the streaming data. Some issues in sampling-based approaches are: (a) when to sample and (b) how to make sure that the sample is an unbiased estimator of the original stream. There are many ways of doing sampling. For example, random uniform sampling and reservoir sampling are some of the popular approaches used on the data stream.

In the forthcoming sections, I present algorithms for some basic tasks in data stream mining. These are filtering and counting in the data stream, sampling from the data stream, and concept-drift detection in data streams. For other tasks, such as classification, clustering, frequent pattern mining, and novelty detection in data stream mining, the interested reader is referred to Gama (2010) and Aggarwal (2007).

## Filtering and Counting in a Data Stream

In this section, w'e discuss algorithms for filtering items which satisfy a property and others which count in a data stream.

### Bloom Filters

Bloom filters are one of the most popular approaches for filtering an item from a data stream. Applications of bloom filters include checking if a user-id is available for sign-up, keeping track of web-pages visited by a user, and recent email addresses used to send messages. A bloom filter consists of three things (Rajaraman and Ullman, 2011).

- • A bit array of length
*n*initialized to 0. - • A number of hash functions /?,,
*h*...._{2},*h*Each hash function maps “keys” to bit-array value by setting the bit._{k}. - • A set of
*m*key values.

We will see an application of the bloom filter used to sign-up at various platforms for finding user-ids. A bloom filter gives guarantees of the form: If a user-id is available, it’s 100% sure of its answer. On the other hand, if a user-id is not available, it gives a probabilistic answer to the query. Let us see this in action. A bloom filter is shown in Figure 1.3. In this figure, there are two hash functions /г, and *h _{2}* w'hich can map a given string to the bit-array. Suppose the strings “cow” and “chicken” are mapped to (6,3) and (2,1) locations. Bits at these locations are set to 1 in the bit-array. Now suppose a query for string “pigeon” is issued. If the hash functions generate hashes 0 and 3, then we see that bit at location 0 is not set. This means that “pigeon” can be used as a user-id. However, if the hashes generated were, say, (2,3), then it means that “pigeon” may be already taken as a user-id w'hich the bloom filter is not sure

FIGURE 1.3 Hashing items in a bit-array, /г, and /(-, are hash functions.

about. The reason is that some other strings might have set the same bit locations as “pigeon” in this scenario and there is no way to be sure who set that bit. For this reason, it’s impossible to delete the user-id using a bit-array. To solve this problem, another variation of the bloom filter which uses *counts* of the “strings” to a location is used. Such bloom filters are called *count-bloom filters.* Some important points to note while designing bloom filters are the following:

- 1. The size of the bloom filter is very important. The smaller the bloom filter, the larger the false positive (FP) and vice versa.
- 2. The larger the number of hash functions, the quicker the bloom filter fills, as well as slow filter. Too few hash functions results in too many FPs, unless all are
*good*hash functions.

A visualization of the FP rate with the number of hash functions is given in Figure 1.4. As we can see, as the number of hash functions grows, the FP rate declines sharply. For a more in-depth analysis of the bloom filter, the reader is referred to Rajaraman and Ullman (2011).

FIGURE 1.4 Effect of the number of hash functions on the false positive rate. Image source: Bhat (2016).

### Counting the Frequency of Items in a Stream

Frequency counting in a stream is the basis of many tasks. For example, if we know the frequency of items, we can compute the mean and variances of them which can be further utilized in downstream tasks such as classification, clustering, and computing similarity. Here, I will present a classic algorithm for counting the frequency of items in a data stream called *count-min* sketch (Cormode and Muthukrishnan, 2004). A count-min algorithm can be seen as an extension of the bloom filter discussed above. The key idea of the algorithm is to map the domain of the stream from

{1to {1.....*d)* using hash functions.

Hash values are stored in a two-dimensional array called *sketches* of width

*w = e / e* and height *h* = log- , where (e,S) denote the accuracy in the approxi-

8

mation and the confidence level, respectively.

An example of a count-min algorithm is given in Figure 1.5. To understand the algorithm, let us take an example. Imagine the data stream looks like:

Initially, the sketch is filled with Os. The algorithm takes each letter at a time and applies hash functions. For example, letter "A” hashes to columns (0,1,3,4,2) by the hash functions in the sketch. Corresponding entries in the sketch are incremented by 1. Similarly, all other letters are hashed. During query time, the same hash function is applied to produce the entries in the sketch. The frequency of the letter is given by taking the minimum of these entries. The count-min sketch is guaranteed to never give false negatives. That means the estimate is either one of true values or overcounts. Over-counts happen when two or more hash functions map at least two letters at exactly the same locations. That is why the choice of a good random hash function is required for the working of the algorithm. For a more in-depth discussion, the reader can consult the original article (Cormode and Muthukrishnan. 2004).

FIGURE 1.5 Count-min sketch. Hash functions map the domain of the stream to a limited range (left). An example of the hash functions mapping letters to integers.

### Count Unique Items in a Data Stream

Counting the number of unique items in the stream presents an interesting problem. Formally, this problem can be posed as follows. Suppose the domain of the random variable is {0.1.....*n*} and the stream looks like:

Here n = 3. A trivial solution is to store all new items in a hash map whose size grows like *0(n).* However, we want a solution which runs only in 0(log(«)). One elegant solution is given by Flajolet and Martin (FM) (1985). FM is given in Algorithm 1.1. Let us work through the algorithm step-by-step.

- • We first initialize a BITMAP to all Os.
- • Then for each item
*x*in the stream, compute a hash value for it. Call it*h(x).* - • The binary representation of the hash value is
*bin(h(x)).* - • The function
*p*gives the position of the least significant set-bit. This also occurs if*p*returns 0 for the 0 bit. - • Finally set the BITMAP[i] = 1.

**Algorithm 1.1: **Flajolet and Martin (FM) Algorithm

Input: stream M Output: Cardinality of M

t initialization: BITMAP OF *L* bits initialized to 0.;

- 2 for
*each x in M*do - • Calculate hash function
*h(x).;* - • get binary representation of hash output, call it bin(/i(x)).;
- • Calculate the index i such that
*i = p(bin(h(x))),*where*p()*is such that it outputs the position of the least-significant set bit.i.e., position of 1.; - • set BITMAP[i] = 1.;

*i *end

4 Let *R* denote the largest index *i* such that BITMAP[i] = 1; s Cardinality of *M* is *2 ^{R}/ф* where

*ф*к 0.77351;

The above steps are repeated until the end of the stream. Let *R* denote the index of the most significant set-bit. Then, the number of unique items is given by 2* ^{к}/ф* where

*ф*« 0.77351 is a correction factor.

The intuition behind the working of the FM algorithm is as follows. The FM algorithm uses a hash function that maps the strings in the stream to uniformly generate random integers in the range [0, ...,2^{i} - 1]. Thus, we expect that:

- • 1/2 of the numbers will have their binary representation end in 0 (divisible by 2).
- • 1/4 of the numbers will have their binary representation end in 00 (divisible by 4).
- • 1/8 of the numbers will have their binary representation end in 00 (divisible by 8).
- • In general, 1
*/2*of the numbers will have their binary representation end in^{R}*0*^{R}.

Then, the number of unique strings will be approximately 2^{R} (because using *n* bits, we can represent 2" integers). A worked out example of the FM algorithm is as follows.

Assume stream *M* = {..., 1,1,2,3,1,4,2,...} and the hash function *li(x) = *(2л-+ 1) *%* 5.

Then,

and

and BITMAP looks like

so that we see the largest integer has index *i* = 2. So *R = 2* and the unique integers are 2* ^{2}/ф* = 5 approximately (the exact answer is 5).

**Exercise: **Try finding the unique numbers in *M =* {5,10,15,20,...) with any hash function. Extensions:loglog and *Hyper* log log algorithms are an improvement over the classic FM algorithm. The interested reader may consult Flajolet et al. (2007).