Big Data Management

The complex nature of Big Data resources requires not only the use of various techniques to process them, but also to organize the entire process of managing such data. There are five identified phases in Big Data analysis system which include (Agrawal et al., 2011): (a) acquisition/recording; (b) extraction/cleaning/ annotation; (c) integration/aggregation/representation; (d) analysis/modeling; and (e) interpretation.

The data acquisition stage is a particularly important because the quality of the results of the analysis depends on its performance. The main difficulty at this stage is to reach relevant and reliable data sources. NoSQL databases are frequently used to acquire and store Big Data. Such systems just extract all data and do not categorize them or parse them by designing a schema. There exists a big challenge to generate right metadata to make a description of all data that are recorded, and the ways in which they are recorded and measured.

The second phase refers to cleaning and extracting the information that has already been received. It is necessary to change the format of the distributed data and prepare it for further analysis. The information that can be extracted from the data depends on its quality. It means that poor-quality data will almost always lead to poor results (“garbage in, garbage out”). Therefore, data cleaning (or scrubbing) is highlighted as one of the most important steps that should taken before data analysis is conducted. This often involves significant costs as the whole process can take from 50 to 80% of a data analyst’s time together with the actual data collection costs (Reimsbach-Kounatze, 2015).

The next step involves preparing and processing the data by using specific programs and programming languages, in other words organizing data. All data must be comprehensible for the computers. It has to be noticed that there is more than one way to store the information, which means that depending on the purpose the data can be presented differently in a more effective way.

The step of analysis/modeling refers to the use of different data mining techniques (Schmarzo, 2013; Zhao, 2015). They include mainly (Chen et al., 2012) clustering, classification and prediction, outlier detection, association rules, sequence analysis, time series analysis, text mining, and also some new techniques such as social network analysis and sentiment analysis. Every data mining model relies on machine learning—supervised or unsupervised.

At the last stage, a critical assessment of the results obtained should be made. First of all, it should be decided whether the results obtained can be considered reliable, taking into account the scope of the sources analyzed. If the results obtained do not raise any doubts, they can be proceeded to their descriptive formulation and conclusions can be drawn which is the basic goal of the whole process.

The management of Big Data would not be successful if it was not for an appropriate environment that could support Big Data in dealing with storage, analytics, reporting, and applications. The environment must include all considerations of hardware, infrastructure software, operational software, management software, well-defined application programming interfaces, and even software developer tools (Hurwitz et al., 2013). The appropriate employment of Big Data algorithms to the analysis of the data of sufficient quality can provide numerous opportunities for improvements in the whole society. In addition to the market-wide benefits such as defining a more effective way of matching products and services to consumers, Big Data can also create opportunities for low-income and underserved communities (Ramirez et al., 2016).


