What Is Big Data?
The term Big Data has been in use and popularised by John Mashey since the 1990s (Lohr, 2013). Big Data includes a huge number of datasets that generally cannot be captured, curated, managed or processed within a tolerable time frame with commonly used software tools (Snijders et al., 2012). However, as the size of Big Data keeps on increasing over the years, techniques and technologies with which to integrate data and to reveal insights from complex and diverse datasets are required (Hashem et al., 2015). A standardised definition of Big Data is not available and hence we present some of its notable definitions. In 2011, the McKinsey Global Institute pointed out that ‘Big Data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze’. Gantz and Reinsel (2011) described Big Data as ‘a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high velocity capture, discovery and/or analysis’. Bayer and Laney (2012) defined the concept, insisting that regular technologies and tools cannot store or process Big Data as they ‘are high volume, high velocity, and/ or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization’. Big Data usually is encompassed by three types of data: unstructured, semi-structured and structured data, among which the main focus is unstructured data (Dedic & Stanier, 2016). Figure 3.1 presents the word cloud formed with the key terms appearing in the abstracts of Big Data-related journal articles.
Big Data is generally structured multi-dimensionally and can be characterised by the concept of various ‘Vs’ which grow exponentially (Shao et al., 2014). IBM scientists have identified four key aspects of Big Data as Volume, Velocity, Variety and Veracity (Taylor-Sakyi, 2016). Figure 3.2 presents the I.N.M.’s 4Vs of Big Data and the descriptions are detailed below:
i) Volume. This represents the quantity of data. The quantity is huge and growing every day. IBM predicts that there will be around 35 zettabytes stored by 2020. Companies may store enormous amounts of data which might be a mix of data from customers from web logs stored in databases and also the real-time information acquired through sensors in terms of production information, inventory and shipments (Chen & Zhang, 2014).
FIGURE 3.1 Cloud visualisation of key words appearing in abstracts of Big Data-related articles (adapted from http://www.professorwidom.org/bigdata. accessed online on: 20 March, 2019).
FIGURE 3.2 4Vs of Big Data (adapted from http://w'ww.vijayraghunathan.com/big-data analytics/ accessed online on: 20 March 2019).
ii) Velocity. This represents the speed of data generation and transmission. The data can be processed in batch, real-time or even streamed (Assunfao et al., 2015). In today’s world, with inventions and increased usage of computers and smartphones, data is generated and transmitted at a faster rate. In manufacturing industries data velocity i.e., the frequency of data occurrence is to be noted carefully. For example, in production facility control cases various log data, sensor information and in-out plant environment information need to be collected and processed in a timely manner.
iii) Variety. This refers to the different sources from which data is acquired and the range of data types involved. The fact that data can be generated from different sources means internal and external source may be in different formats. In this, some data are structured and some are unstructured, whereas some are semi-structured (Tan et al., 2015).
iv) Veracity. The structured and unstructured data are generally coming from different sources and so their quality varies. Thus, the reliability and accuracy of the data are difficult to control and there is a question of relevance and credibility. Thus, veracity refers to checking the quality of data.
Other than these four major ‘Vs’ there are also other characteristics of Big Data identified by various researchers. Pence (2014) stressed the need for analysing an important characteristic called complexity. Complexity refers to the degree of interdependencies between the structures in the data acquired. This allows studying the small or big changes in a few components that can lead to drastic behavioural system-level changes (Satyanarayana, 2015). Visualisation is another characteristic which refers to the effective and meaningful representation of complex and large data to the users for easier understanding. Graphs, maps and figures are some visual representations which can be easily interpreted by the human brain and hence make the decisionmaking easier (Tang et al., 2016). The accuracy and correctness of the data which can be readily used are considered another characteristic of Big Data. This characteristic is called ‘Validity’ and it refers to how relevant is the data is to the organisation’s current strategy to provide any useful insights. Volatility is another characteristic of Big Data and it deals with shelf-life of the data. It relates to the extent to which data can be stored and used. This is because we will have to handle a huge quantity of real-time data which are sensitive and are bound to frequent changes. The final characteristic identified is the Value of Big Data. It is already established that Big Data is a going to be a mix of structured, unstructured and semi-structured data out of which the case- specific data needs to be extracted. The corresponding data which relates to what we require is the measure of usefulness i.e., the value. The value of Big Data can be mathematically expressed as a combination of all the other characteristics as below: