Big data is a phenomenon that is defined by very rapid expansion of raw data. It refers to the large volume of data which is more than the storage capacity and requires more processing power than the traditional systems. Currently we are living in the world where data is the most valuable thing. So, the important thing is how to store, process and analyse the data, to get more knowledge from it. This large volume of data comes from many applications like sensors, social networks, online shopping portals and Government agencies. Storing and processing such data is a challenging task.
Big data is distributed everywhere across the multiple machines. It is a massive or vast collection of not only great quantity of data but also various kinds of complex data which previously never would have been considered together and it exceeds the processing capacity of conventional database system to capture, store, manage and analyse. Figure 1 shows the framework of Big Data through two data sources (realtime streaming data & batch data) and three data analysts (Data owner, technical analysts & business analysts) along with data storage infrastructure.
There are mainly three categories of data: structured data, semi-structured data and unstructured data (Bill Vohries, 2013). Structured data are highly organized data which have a pre-defined schema like relational database management system. Semi-structured data are those data which cannot be stored in rows and tables in a typical database. They have inconsistent structure like logs, tweets, sensor feeds.
Figure 1. Big Data architecture
Unstructured data lack structure or are not structured like free form text, reports, and customer feedback forms. Big data is the combination of all the three types of data. It has to face three important challenges (B. Gerhardt et al., 2012):
- • Volume: The volume of data is very large and cannot be processed on a single system. Its size may be in Terabytes, Petabytes and so on.
- • Velocity: We need to fetch and process that data again and again. So we need to access it several times. So velocity is the speed to fetch data stored on particular node and the speed of the data coming in from various sources.
- • Variety: It consists of structured, unstructured and semi-structured data. Hence managing different types of data is the main challenge.
In addition to the 3 V’s there are some other challenges of big data that are presented below:
- • Veracity: It is the quality of captured data, which can change dynamically. Veracity of data affects the accuracy of data analysis results.
- • Value: It is the knowledge that can be extracted from huge amount of data by performing data analysis. This value is very important aspect of data from business point of view.
Advancement in digital devices and sensors, communication, calculation and storage has generated vast assemblage of data. Collecting and managing data for industries and enterprises, scientific research, government and civilization was the main challenge. Since Big data is a modern forthcoming technology in new era which can bring great profit to business enterprises and also a lot of issues like data management and sharing, understanding the data, addressing data quality and handling errors, displaying meaningful result and the most important are privacy, security and trust of data. The appropriate solutions to these challenges are necessary to help the commercial organizations to move towards this technology to increase the value of enterprises, science and society. Big data typically covers huge volume of information related to personal identities and therefore security and privacy of the user becomes a big concern. Without the correct security in right place, Big data can be at high risk. Business organizations must have to confirm that they have the perfect balance between utility and privacy of data, when they are gathering it. Before the data is stored, it should be made appropriately anonymous, by removing any identity related information of any specific user. This in itself may be a security challenge as eliminating identity related information might not be sufficient to ensure that the data will remain unidentified. There are two main things which should be considered from a security point of view, the one is securing enterprises and its employee’s information in a Big data perspective and the other is using Big data technology to analyse, and even predict security incidents.