BIG DATA

Evolution of Data

We have seen how the technology evolved in last decade or so. Earlier we use to have wired or landline phones, but the technology advanced and we moved on to mobile phone. While working a computer we use to store the data on CD/floppy disk that were capable of storing a few MBs. Now that data has increased in large volume, and we now have the cloud storage for that. Take for example the use of mobile phones. We cannot imagine how much data we are producing every second, every action of a user performs even if one single picture is sent through the phone data gets generated. Now this kind of data generated is not in the format that a relational database can store and also the volume of the data has increased exponentially.

There are some other platforms that are giving large amounts of data, such as the IoT (Internet of Things) and social media. IoT connects the physical device with internet and makes a device smarter like smart TV, smart AC, etc. IoT devices includes various sensors that collect the data and can perform actions accordingly. So, we can see the volume of data generating when we have lot many sensors connected. According to (Fingerman, 2019) the IoT will grow to about $520 billion by 2021, and the total number of IoT devices is projected to be 75 billion by 2025 (Statista Research Department, 2016).

Similarly, social media is another important factor in the evolution of Big Data. Everyone is using Facebook, Twitter, etc., as social media. All the pictures posted, all the information shared, the user profiles, everything is data in these social networks. These social media sites also show us that the data is generating in various formats, but the data generated and stored is not a structured data, and the volume is quite large.

WHAT IS BIG DATA

Big Data is a term that is used for data sets generated and collected both in structured and unstructured form in such a high volume that it becomes very difficult to manage and process this data using the traditional database tools and applications.

THE 5 Vs OF BIG DATA

Big data is very beneficial to any organization, as the organization would be able to gather, store and can modify the enormous amount of data when the data is arriving at right speed and at right time (Figure 9.4). The five V’s are volume, variety, velocity, value, and veracity. These are examined in the following sections.

Vs of big data

FIGURE 9.4 5 Vs of big data.

Volume

The volume of big data represents the amount of data generating. The volume of the data is increasing exponentially day by day. According to Forbes, 175 zettabytes of data will be generated by 2025 because the data is coming from different sources. The organization working in data collection has to deal with this huge amount of data. Sometime this large amount of data helps in improving the quality of the work of the organizations by predicting the results of the products.

Variety

Since there are different sources for the large volume of data, there can be different formats for these different sources. The different type of data can be broadly classified as unstructured, structured, and semi-structured.

An unstructured data could include data from different logs, such as an audio, video or an image file data.

In a structured format, the data is arranged in a structured tabular format with proper rows and columns and with the known schema of the table.

When we talk about a semi-structured type of data, all the data that is generated and received in various formats like Json, XML, CSV, TSV, or from the emails where schema is not defined properly are semi-structure data.

Velocity

Earlier when the users worked on a computer without internet access, they were working only on the stand-alone systems and not connected to the world. When the internet arrived, user data started generating, but the data was less in quantity and the speed of processing the data was very low. Then slowly and steadily as the number of users increased, the data generation also increased. It is when people started working on smartphones that the process of data generation reached a huge volume with more users sharing and posting of data.

There are few activities that are important for some users and need immediate action on the data, and thus the processing of the data has to be quick. So, for Big Data it is important that the rate at which the data is received, the processing of the data has to be on same rate.

Value

Now that the volume of data, variety of data, and velocity of data is done, we need to find the useful data out of the data collected. After the data is collected from various sources in different format, the useful data needs to be analyzed depending upon the type of data required for hypothesis or a particular process and that can help improve the growth of any business.

Veracity

Veracity can be defined as the degree with which a user can trust the information for making a decision. Thus, it is important that whatever the value of data is extracted from the big data, it has to be accurate. As we can see in Table 9.2, there are lots of inconsistencies and some values are also missing. This happen because when a huge amount of data is put in, some data packets might get lost in the process. So, these missing values need to fill again, and the process of finding value starts all over again. Hence it is sometimes a challenge in finding the value of data as some business organization does not trust the value easily to reach to some decision.

TABLE 9.2

Sample Table Showing Inconsistency and Missing Values

Min

Max

Mean

SD

1.2

4.25

0.87

2.5

4.1

3.52

400.000

1.200

7.8

1.6

0.78

0.2

5.2

_

4.2

 
Source
< Prev   CONTENTS   Source   Next >