Role and Support of Image Processing in Big Data
- Big Data Mathematical Analysis Theories
- Independent and Identical Distribution Theory (IID)
- Set Theory
- Characteristics of Big Data
- Different Techniques of Big Data Analytics
- Ensemble Analysis
- Association Analysis
- High-Dimensional Analysis
- Deep Analysis
- Precision Analysis
- Divide and Conquer Analysis
- Perspective Analysis
Images can be defined as a set of signals sensed by human eyes and processed by means of visual cortex in the brain, thus creating intensely deep experience of a scene, which is closely associated with objects and concepts perceived previously and recorded in the memory of a human brain. For computers, images can be raster images or vector images, where raster images are a sequence of pixels with discrete numerical values for color, whereas vector images are a sequence of color-annotated polygons. In order to analyze the image or video, geometric encoding needs to be transformed into constructs used to depict physical features, objects, and movement represented by the image or video. These constructs are analyzed in a logical manner by a computer through various image-processing techniques.
Big data analytics can be used for identifying underlying patterns and intrinsic inter-relationships among large volumes of data sets. Big data analytics is governed by two factors, that is. quantity of data and quality of data, where the amount of data helps in uncovering recurring data patterns, whereas the quality of data is used to find the reliability of data patterns. Thus, it can be concluded that the quality of data defines the fitness of data to be used in any form or as an application for modeling or predictive purposes (Agrawal and Srikant, 1994).
The role of big data analytics has been increasing in various fields of image processing from medical image processing to processing of satellite images. As most of the medical, satellite, and agriculture data are in huge amounts, which need fast processing for faster results, the need for big data analytics comes into existence (Sudhir, 2020).
Image processing needs innovative techniques to process images. With advancement in image processing, it is being used in many areas such as industries, organization, social organization, administrative divisions, healthcare, defense, etc. In image processing, images are taken as the input and then processed using different techniques to obtain a modified image or a video as the output. Collection of a large dataset of images and videos can be processed in a faster and efficient manner using the concept of big data application which stores the data as structured or unstructured data as a result of processing images using the concept of computing techniques. These processed data or big data that were integrated with image processing are used in the fields where the data in the form of image and video are large, such as education. healthcare, manufacturing, retail business, banking, finance, etc., thus showing greater potential in various fields.
In terms of latest updates, Google is working on integration approach of image processing with big data application by simulating the ability of the human brain to compute, evaluate, and choose the course of action by using massive neural networks. This makes the analytics of image and video science scalable using machine vision, rule-based decision engines, and multi-lingual speech recognition. Thus, image analytics can be considered as a potential solution for economic, social, industrial, and political issues.
Big Data Mathematical Analysis Theories
Data can be collected in a collective manner or an individual manner. Some of the examples of collective data collection are data of smart cities, national geographic condition monitoring data, and earth observation data (Li et al., 2014). These collective data were gathered using sampling strategies and hence the quality of data is high, whereas the examples of individual data collection are data from social media in the Internet, electronic business data, etc. (Shaw and Fang, 2014).
With the advancement, most of the data are digital in nature and with the increase in storage capacity and methods of data collection, huge amounts of data are easily available. More and more data are created on every second basis, and in order to extract the information of concern from this huge amount of data, there is a need for good analytical methods.
Big data is a tool that is used to scale, distribute, and diversify and/or with timeliness requires the use of new technical architectures, analytics, and tools to enable the insight in order to unlock the new sources of business value.
Data are a set of values of quality and quantity variables which need some mathematical approaches for the analysis of big data. There are two mathematical theories for the analysis of big data as given in Figure 3.1.
Independent and Identical Distribution Theory (IID)
Big data science is an extension of statistics and machine learning which is used for data sampling and inferences. Thus, data science can be considered to provide an optimal decision about sample estimation of population in asymptotic theory along with function approximation in specific domain criteria. IID is used to simplify the underlying mathematics of many statistical inferences. The IID theorem states that the probability distribution of the sum (average) of IID variables with finite variance approaches a normal distribution or Gaussian distribution, which is an elementary
FIGURE 3.1 Mathematical theories for the analysis of big data.
probability distribution that can be extended into mixed Gaussian distribution and generalized distribution. Mixed Gaussian distribution and generalized distribution are used to solve complex problems, which are correspondingly related to non-linear models in algebra.
In the relational database, a relation is a set of tuples with the same attributes or fields. Thus, a tuple defines an object and its information, whereas a domain is a set of possible values for given attributes and hence can be considered as a constraint on values of attributes. Relational algebra consists of a set of tuples with five sets of operations, that is, as union, intersection, joining, projection, and selection, w'here union combines the tuples of two relations by removing the redundant tuples from the relation while intersection produces a combined result of tuples that share some relationship in common. The joint operation is the Cartesian product of two relations restricted by the same join criteria, w'hereas the projection operation is used for extracting useful information or attributes from tuples. Selection operation is used for selecting some tuples from the relation or table by limiting the results which fulfill the criteria. A mathematical space can be defined as a set assigned with added structure.
Characteristics of Big Data
The characteristics of big data can be defined by 3 Vs, namely, volume, variety, and velocity, where volume is the size of data which defines how much are the data in terms of gigabyte, petabyte, etc. The variety of data defines the different format and types of data as well as their different types of uses and different ways of analyzing the data, whereas the velocity of data refers to the rate of changing of data and their rate of creation. Some researchers also added the fourth V as veracity, which defines the quality of data that is used to characterize the big data as good, bad, or undefined data based on data inconsistency, latency, incompleteness, deception, ambiguity, and approximations.
The data volume can be quantified in terms of size as terabyte, petabyte, etc. as well as it can be quantified in terms of number of records, tables, transactions, and number of files. The reason for a large amount of data can be justified by the number of sources they come from such as logs, clickstream, social media, satellite data, etc. Data can be divided as unstructured, semi-structured, and structured data. The unstructured data are data such as text, human languages, etc., w'hereas the semistructured data are data such as extensible Markup Language (XML) or Rich Site Summary feeds, and w'hen these unstructured and semi-structured data are converted in a form that can be processed by a machine, then it is termed as structured data. There are some data w'hich are hard to be categorized such as audio or video data. There are also streaming data w'hich are available on real-time basis. There is also another type of data w’hich is multi-dimensional data that can be drawn from a data warehouse to add historic context to big data. Thus, with big data, variety is just as big as volume. The velocity of data is the speed or frequency of generation or delivery of data.
Different Techniques of Big Data Analytics
Raw or unprocessed data are a collection of numbers and characters. Data sampling and analytics is a method of extracting decision support information, whereas knowledge is derived from large experience on subjects. Thus, data can be considered as an observation about real phenomena. There are basically seven techniques used for big data analytics which are given in Figure 3.2 and are discussed below.
Ensemble data analysis can be used for a whole data set or large volume of data and hence be termed as multi-dataset analysis or multi-algorithm analysis. The whole dataset can be resampling data, labeled or unlabeled data, and prior and posterior data. The word ensemble comes from the concept of machine learning which uses supervised and unsupervised data for processing purpose. So, the machine learning approach ensemble analysis can be used for the analysis of data sets extracted from rough data sets using certain algorithms or a group of algorithms in order to provide an efficient analysis of the dataset. Ensemble analysis uses bootstrapping, boosting, stacking, bagging, and random forest learning approaches.
This analysis is based on the logic that the relationship among set members corresponds to association in big data. Thus, association analysis can be used for multitype, multi-sourcing, and multi-domain analysis of data. Association analysis is exemplified with association rule algorithms in data mining (Agarwal and Srikant, 1994) and data association in target tracking and link analysis in networks.
Mathematically, the dimension of objects is the information defined as the minimum number of coordinates needed to specify any point within it. whereas the dimension of vector space is the number of coordinates needed for specifying a vector. Thus, the dimension of an object is an intrinsic property which is independent of space where the object embeds. Thus dimensions can be defined as the number of perspectives from which the real world can be recognized. The issue with dimensionality in big data is that as the dimensionality increases, the volume of space increases in such a
FIGURE 3.2 Different Big Data analysis techniques.
manner that available data become sparse; this sparsity leads to a statistically significant error and dissimilarity between objects in high-dimensional space.
In order to keep the variability of original variables as such by using the metrics of distance or variance, dimension reduction is used for reducing the number of random variables and finding a subset of the original variables, which leads to transformation of linear and non-linear dimension reduction. High-dimensional analysis uses the same concept for analysis of such high-dimension data.
The deep analysis technique uses the concept of deep learning for the analysis of a high amount of data. Thus, deep analysis is used for exploring complex structure properties of big data. Thus, deep analysis can be used for analyzing unobservable variables, hidden hierarchies, local correlations, hidden parameters, and the complex distribution of random variables. Deep analysis is thus defined as a deterministic or stochastic transform function from input to output.
Precision is defined as resolution of the representation defined by the number of decimal or binary digits. It is also related to accuracy, which is defined as nearness of the calculated value to original value. A measurement system can be either accurate or precise, but not be both. When an experiment contains a systematic error, then increasing the size of samples generally increases precision but does not improve accuracy. Eliminating the systematic error improves accuracy but does not change precision. Thus, both of them are alternatives to each other.
Precision analysis is used for evaluating the veracity of data from the perspective of data utility and data quality. In order to form a relationship between big data and linguistics, the veracity is analogous to the semantics, and the utility is analogous to the pragmatics.
Divide and Conquer Analysis
Divide and conquer analysis is a computational strategy which is used for increasing the efficiency of solving the problem and velocity of big data computation. This analysis breaks a problem in a recursive manner into two or more sub-problems until the problem becomes simple enough to be solved directly in the stage of conquering. Upon completion, the solutions to sub-problems are combined into a solution to the original problem. Divide and conquer analysis gives better performance in multiprocessor machines, and distinct sub-problems can be executed on different processors. Information is exchanged by passing messages between the processors.
Perspective analysis is defined as a set of operations to process the structured and unstructured data by means of intermediate representations to create a set of prescriptions. These operations produce essential changes to variables, and variables influence target metrics over a specified time frame. Thus, perspective analysis can be used for representation of time series, detected patterns, and relationships between different sets of variables and metrics. A predictive model can be used for predicting the future time series of metric through forecast influencing variables. It first transforms the unstructured and structured data into analytically prepared data. Then this image analytics is used for providing real-time analysis and for future prediction in different sectors.