All vision problems generally require many complex phases up to final solution of any particular research work. Table 10.1 presents basic classification strategies that are referred as core machine learning. The structure of the basic neural network topology is represented in Figure 10.3. Kumar [27,28] presented detail architecture of the important deep networks architecture and large-scale data analytics issues, which are shown in the next subsections.


hi this section, the specific components of deep network structure are described. Basic features of these components discriminate the conventional neural network to deeply learning neural network in the context of exploiting the complex feature space of the visual objects and resolve the severe computational issues with such feature space.

The main component of a deep neural network is the CNN, which focus on convolution as major operation of a large number of matrices. The intelligent system with the CNN can efficiently process complex features. In this reference, apart from the CNN, the vital technology includes the RNN, long short-term memory, autoencoders, and restricted Boltzmann machine. In this chapter, we primarily focus on the brief architecture of the CNN and describe its performance to solve video processing tasks. In general, every CNN model contains four basic building blocks of convolution, sampling, nonlinearity unit used in CNN, and Hilly connected layers used for feature classification.

TABLE 10.1 Machine Learning Algorithms and Its Feature Processing

Machine Learning Schemes

Feature Description

Supervised (labeling based) Learning methods

• A training model is developed with training

labeled datasets and tested against the samples. For example, classification and regression

Unsupervised (unlabeled) learning

• Missing the training labeled samples. For example, clustering and noise reduction etc.

Reinforce learning

• Based on penalty and reward function for input sample Data. For example, Markov decision process

Active machine learning

• Sampling is performed based on query on selective data

Representation-based learning

• Includes feature engineering of its selection, extraction, and reduction.

Transfer-based learning

• Domain-invariant-based transfer learning

Kernel-based machine learning

• Nonlinear multimedia processing to reduce high dimensional, e.g., LDA, PCA, support vector machine, etc.


This refers to introduction of a class of thresholding in the layers of traditional neural network. The layers of nonlinearity are introduced by utilizing the activation or sigmoid function. In a real-life scenario, almost the problems are nonlinear, which are incorporated by introducing by the same nature of activation function to get threshold for the CNN. The most basic thresholding functions are rectified linear unit (ReLU) [8,9], logistic function, and exponential linear unit [10]. In general, all the activation functions work on pixelwise operations. The literature of deep learning evidences that several advanced versions of ReLU are more powerful. The examples include modified ReLU, leaky ReLU, parametric ReLU, and randomized leaky ReLU represented in (10.2); (10.3) outperformed the state of the art


The activation of neural network from convolution layers mitigates fine-tuning issues and maps the higher dimensional data space to lower dhnensional data space. The convolution process by a 3'3 kernel with image is basically characterized by three parameters. The number of convolution filters used in the CNN determines the depth of CNN. Other basic components of the CNN include stride and zero padding. Stride refers to the number of pixels that jump during one convolution. In the third component, zero padding is the mechanism that provides the boundary pixel to involve in the convolution process. The basic convolution filters include Gaussian, Laplacian, Sobel, and box filter for the basic building block of filters. It is customary that selecting more number of filters gives better training to the deep network provided that computational tools are sufficient to take care the processing overhead. Apart from convolution, there are many more conceptual layers are referred to introduce for developing a deep network. These layers may be pooling layers, batch normalization layers, drop out, and frilly connected layers. All the layers correspond to different operation in deep network, and introducing these layers in the network depends on the particular objectives.


In this section, we discuss some standard deep networks, which ensure the success of deep learning methods applicable to many engineering and science discipline.

10.5.1 LENET (1990)

hi 1990, when there was no sound in research community for deep learning, Yami LeCirn developed a veiy fust convolution neural network. This is edge of deep learning get break though with specific domain of optical character recognition. The basic LeNet architecture is given in Figure 10.4, which presents the discrimination of visual objects at the prediction layer. This network provides the base to all the modem deep neural networks, which need to have cascading of convolution and pooling and nonlinearity layers. ReLU [15,16] is generally applied before pooling and frilly connected layers of the network.

10.5.2 ALEXNET (2012)

The architecture of AlexNet exploits details of convolution neural network blocks developed in LeNet. Noticeably, only difference is the number of filters used for reducing dimensionality between various pooling layers. The details of the network are represented in [3]. The evidence from ImageNet challenge reported that the network is trained on two NVidia graphics card “GTX 580,” with over 1.2 million sample images of the large dataset. For purely classification, the training of such data sample takes five to six days. The network uses five coevolutions and trains 60 million parameters and 6.5 lakh neurons.

10.5.3 ALEXNET (2012)

Zeiler and Fergus developed a deep network with then name (ZFNet) in 2013 that exploits the intermediate functionality of classification methodology inside the deep network layers. They tweaked the complete architecture proposed in AlexNet convolution neural network. ZFNet demonstrated the state of the art on Caltech 101 and Caltech-256 benchmarks. By training with ImageNet [14] for classification on GTX 580 for 12 days, we developed features of pixel maps as opposing the convolution layers. In the experiential phase of the network, activation and error operations are performed by the ReLU and the cross-entropy loss function. In this network, error computation and action operations are performed by cross-entropy loss and ReLU, respectively. During the classification process on the ImageNet benchmark, the drop-out approach is utilized to achieve regularization.

10.5.4 GOOGLENET (2014)

The massive data of real-world scenario pertain huge numbers of parameters, which mark a black spot at the success of earlier deep network models. The number of parameters used in GoogleNet was only 4 million, whereas Alexnet used 60 million parameters. The reduction of such a large of number parameters was possible by introducing an inception model [35]. The architecture of the inception model is represented in Figure 10.4.

Inception module of the GoogleNet architecture [35]

FIGURE 10.4 Inception module of the GoogleNet architecture [35].

The inception model of the GoogleNet set benchmark in detection and visual recognition literature. From the evidences of several experiments on ImageNetl4, this is concluded that the inception model with its successive versions, that is, inception 3 and inception 4, is a fast and suitable model to resolve the issues complex visual analytics. In this model, the Hebbian principle is adopted to get the better optimization control and results in a model with only 22 layers. The mechanism of the inception model is motivated by selection of the size of convolution filters and concatenation them. The mechanism of convolution used in inception model gets the CNN free from overfitting. The overfitting layers are renamed as global average pooling layers.

10.5.5 VGGNET (2014)

The issue of a large number of layers as required for handling huge number of parameters in large-scale data models was considered a severe problem till 2014. Therefore, this model is highly sounded to deal with large-scale data statistics without bothering about the number of layers used in the network.

The experimental setup for training VGGNet is developed with four Titan Black GPUs and took three to four weeks to accomplish the training phase. This network takes 21-28 days to train with two GPUs “Titan Black.” The model used Caffe toolbox as background and training data is optimized by utilizing stochastic gradient descent (SGD) scheme. The experimental observation [11,13] reports presented in Table 10.2 show that there is only 7.5% error in the validation set on the model with 19 layers and 7.3% error in the test set on top five classification layers.

10.5.6 RESNET(2015)

After the classical performance of VGGNet on large-scale data, it is assumed to develop big network results in higher performance. Such deeper neural networks outperform better with extensive data, but training such a network is a highly tedious job. The main credit is given to Kaiming H, for resolving the processing issue with the deeper network, in ILSVRC 2015 challenge.

The layers used in the network were modified by learning residual functions [11]. This function optimizes the computations and achieves higher accuracy. From the experiments, it is observed that batch normalization also fails to reduce validation and training error while introducing the extra layer to the network. In the ResNet inception model presented in Figure 10.5, this problem is solved by introducing bypass to the summing up the layer with the CNN. The model took 21 days to process 152 layers with the ImageNet benchmark by utilizing eight GPUs.

TABLE 10.2 The Comparative Classification Performance of the ResNet Model

Deep Network Model

Error Analysis (%)

Computation Time (ms)

YGG Model-A



ResNet Model-34



BN-Inception Model



ResNet Model-50



ResNet Model-101



Inception-v3 Model



Inception module of the ResNet model

FIGURE 10.5 Inception module of the ResNet model.


hi this chapter, the prime concern is represented to the processing issues with heterogeneous contents in the visual media of real life. For resolving the highlighted issues, deep data analytics and graphical processing algorithms are needed to develop for extracting and evaluating the information from real-life visual media. Training an optimized deep network with graph-based pixel-level processing is a better choice than handicraft feature engineering for local and global feature extraction. A visual domain generally consists of real-time multimedia activities, which are still an open research problem regarding the semantic understanding of poorly retrieved information from a complex environment. Another issue with the processing of complex information is reported as big data evolution from live streaming of video sequences. In such a case, without introducing big data analytics, the experimental issues can degrade the performance of conventional machine algorithms. Furthermore, five components of big data issues can cause to fail every conventional machine learning technique to retrieve 100% exact information. Therefore, to exploit deep network architecture, it is very customary to develop optimized deep data analytics system and high-performance algorithms. In brief, this work raises the future scope of developing deep neural networks with some optimized algorithms that can give better performance to compute the unstructured visual data with large scale analytics.


The CNN has been performed successfully for solving an image classification problem. However, in the case of video-based experiments on big data, developing a big network does not guarantee higher performance. This motivates that processing of large-scale visual media requires to build up the mechanism for every layer in the network. Such high-level processing cannot be expected without a rich source of processing devices. Therefore, to compromise with costly computational resoitrces and time economy, this is necessarily required to develop an optimized deep network for extracting information from unconstraint visual media. The conclusive remark for future issues this sounds the absence of an adaptively fast and small deep network. Although stacking with more layers and effective subsampling can increase the size of receptive field but central receptive field of each neuron does not participate equally. The throughout study of this work concludes that many issues such as SGD, graphs, and Riemarmian-rnanifold-based deep learning can still be needed to practice for a better solution to open challenging problems in social media communication, sensor network of neuron for analysis of the human brain, stock market for financial evaluations, and geographical study based on a network of satellites. In the review work presented in [24], the main problem for machine learning with graphs is highlighted as the information association between the nodes. The encoding and decoding scheme with the embedding of graphs is helpful for informatics. Recently observed state of the art presents scalability and interpretability in temporal graphs as open issues [26].


  • big data analytics
  • convolution neural network
  • long short-term memory
  • deep learning
  • graphs and manifolds
  • unstructured complex datasets
  • visual media


  • 1. Jordan. M. I. (Ed.) (1998). Learning in Graphical Model, Yol. 89. New York. NY, USA: Springer Science & Business Media.
  • 2. Monti. F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., & Bronstein, M. M. (2017, July). Geometric deep learning on graphs and manifolds using mixture model CNNs. In Conference on Computer Vision and Pattern Recognition, p. 3.
  • 3. Riba, P., Dutta. A., Llados. J., & Fomes, A. (2017, November). Graph-based deep learning for graphics classification. In 14th IAPR International Conference on Document Analysis and Recognition (pp. 29-30).
  • 4. K. Tombre and B. Lamiroy (2003), “Graphics recognition-from re-engineering to retrieval,” in International Conference on Document Analysis and Recognition, pp. 148-155.
  • 5. N. Nayef and T. M. Breuel (2010), “Abranch and bound algorithm for graphical symbol recognition in document images,” in International Workshop on Document Analysis Systems, pp. 543-545.
  • 6. J. Gantz and D. Reinsel, Extracting Value from Chaos. Hopkinton. MA, USA: EMC, Jun. 2011.
  • 7. J. Gantz and D. Reinsel. The Digital Universe Decade—Are You Ready. Hopkinton, MA, USA: EMC, May 2010.
  • 8. Naresh Babu, К. V., & Edla, D. R. (2017), “New algebraic activation function for multilayered feed forward neural networks,” IETE Journal of Research, 63(1), 71-79.
  • 9. Xu. B., Wang, N„ Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. arXivpreprint arXiv:1505.00S53
  • 10. Clevert, D. A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (ELUs). arXiv preprint arXiv:1511.07289.
  • 11. Huang. F., Ash, J., Langford, J., & Schapire, R. (2017). Learning Deep ResNet Blocks Sequentially using Boosting Theory. arXiv preprint arXiv:1706.04964.
  • 12. Liu, W., Anguelov, D., Erhan, D., Szegedy, C.. Reed, S., Fu, C. Y., & Berg, A. C. (2016), “SSD: Single shot multibox detector,” In European Conference on Computer Vision (pp. 21-37.
  • 13. Haung, G., Liu Z., Weinberger, K. Q., & van der Maateu, L. (2016), Densely Connected Convolutional Networks. arXivpreprint arXiv: 160S.06993.
  • 14. Krizhevsky, A., Sutskever, I., & Hinton. G. E. (2012), “ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Piocessing Systems (pp. 1097-1105).
  • 15. Xu, B., Wang, N.. Chen, T., & Li, M. (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853.
  • 16. Jin. X., Xu, C., Feng, J., Wei, Y„ Xiong, J., & Yan, S. (2016), “Deep learning with S-shaped rectified linear activation units,” In 30th .4.1AI Conference on Artificial Intelligence (pp. 1737-1743).
  • 17. Tian, F., Gao, B., Cui, Q., Chen, E., & Liu, T. Y. (2014, July). Learning deep representations for graph clustering. In 28th AAAI Conference on Artificial Intelligence (pp. 1293-1299).
  • 18. Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. arXiv preprint arXiv:1801.07455.
  • 19. Jain, A., Zainir, A. R.. Savarese, S., & Saxeua. A. (2016). Stmctural-RNN: Deep learning on spatio-temporal graphs. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 5308-5317).
  • 20. Scarselli. F., Gori, M., Tsoi, A. C., Hagenbuchuer, M., & Monfardini, G. (2009). The graph neural network model. IEEE Transactions on Neural Networks. 20(1), 61-80.
  • 21. Kipf, T. N.. & Welling, M. (2016). Variational graph auto-encoders. arXiv preprint arXiv:1611.07308.
  • 22. Niepert, M.. Ahmed, M., & Kutzkov. K. (2016, June). Learning convolutional neural networks for graphs. Ininternational Conference on Machine Learning (p-p. 2014-2023).
  • 23. Kipf, T. N.. & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
  • 24. Hamilton, W. L., Ying, R., & Leskovec, J. (2017). Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584.
  • 25. Baradel, F., Neverova, N.. Wolf, C., Mille. J., & Mori, G. (2018). Object Level Visual Reasoning in Videos. arXiv preprint arXiv:1806.06157.
  • 26. Bui, T. D., Ravi, S., & Ramavajjala. V (2018, February). Neural Graph Learning: Training Neural Networks Using Graphs. In 11th ACM International Conference on Web Search and Data Mining (pp. 64-71).
  • 27. Kumar, N. (2017, December). Large scale deep network architecture of CNN for unconstraint visual activity analytics. In International Conference on Intelligent Systems Design and Applications (pp. 251-261).
  • 28. Kumar, N. (2017, December). Machine Intelligence Prospective for Large Scale Video based Visual Activities Analysis. In Ninth International Conference on Advanced Computing (pp. 29-34).
  • 29. Gallagher, B. (2006). Matching structure and semantics: A survey on graph-based pattern matching. AAAI FS, 6, (45-53).
  • 30. Jargalsaikhan. I.. Little, S., & O'Connor, N. E. (2017, August). Action localization in video using a graph-based feature representation. In 14th IEEE International Conference on Advanced Video and Signal Based Smveillance (pp. 1-6).
  • 31. Zanfir, A., & Sminchisescu, C. (2018). Deep Learning of Graph Matching. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 2684-2693).
  • 32. Zhang, M., Cui, Z., Neumann, M., & Chen, Y. (2018). An eud-to-eud deep learning architecture for graph classification, hi AAAI Conference on Artificial Intelligence.
  • 33. Duvenaud, D. K., Maclauriii, D., Iparraguirre, J., Bombarell. R., Hirzel, T., Aspuru- Guzik, A., & Adams, R. R (2015). Convolutional networks on graphs for learning molecular fingerprints. In International Conference on Neural Information Processing Systems (pp. 2224-2232).
  • 34. Henaff, M., Bruna, J., & LeCun, Y. (2015). Deep convolutional networks on graph- structured data. arXivpreprint arXiv:1506.05163.
  • 35. Szegedy, C., Liu, W., Jia, Y., Sennanet, P., Reed, S., Anguelov. D., & Rabinovich, A. (2015). Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9).
  • 36. Wang, L., Guo, S., Huang, W„ & Qiao, Y. (2015). Places205-vgguet models for scene recognition. arXiv preprint arXiv: 150S. 01667.
  • 37. Krizhevsky. A., Sutskever, I.. & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems (pp. 1097-1105).
  • 38. Redmou. J., Divvala, S., Girshick. R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 779-788).
  • 39. Redmou, J., & Farhadi, A. (2017). YOL09000: Better, Faster, Stronger. arXiv preprint.
  • 40. Ren. S.. He, K.. Girshick, R.. & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In International Conference on Neural Information Processing Systems (pp. 91-99).
  • 41. Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., & Broustein, M.M. (2017, July). Geometric deep learning on graphs and manifolds using mixture model CNNs. In IEEE Conference on Computer Vision and Pattern Recognition, p. 3.
  • 42. Bronstein. M. M., Bruna, J., LeCun.Y., Szlam,A., & Yandergheynst, P. (2017). Geometric deep learning: Going beyond Euclidean data. IEEE Signal Processing Magazine, 34(4), 18—42.
  • 43. Shi J. and Malik J. (2000). Normalized cuts and image segmentation. IEEE Ti ansactions on Pattern Analysis and Machine Intelligence, 22(8):888-905.
  • 44. Huang, Z., Wan, C., Probst, T., & Van Gool, L. (2017). Deep learning on he groups for skeleton-based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1243-1252).
  • 45. [Online], Available:,examples_datasets/imageuet.html
< Prev   CONTENTS   Source   Next >