Big Data Analytics Tools

Table of Contents:

Analytic tools are used widely to analyze the data in meaningful form among the sets of large volumes of data. Now in the market, there are many tools evolving every day to analyze the data in all forms to extract the value from the available data. It reduces the cost and time of any company as the past data can be used to take as many decisions as possible in critical situations. It is turning out to be a must for all companies because of the productivity it gives at a minimal cost. As most of the tools are open source, users can download them for free and change the modules as per the requirements of the organization.

1.10.1 Hadoop

It is a software library framework, where processing of distributed systems will be done on clusters (Vavilapalli et al. 2013)- It can be extended from one to multiple machines in any environment. It is one of the most commonly used tools in big data processing in many companies for extracting the data; the hardware requirements are less as the data processing is mostly done on the cloud. Features of Hadoop

Many features that are associated with Hadoop; few major features are listed here. HDFS

It is one of the distributed file systems that run on any hardware but gives high performance and throughput by using the MapReduce algorithm. The Hadoop File System (HDFS) stores data across multiple machines by replicating the data in all other servers, in case if any data fails in primary servers. It contains two nodes:

■ Name node

■ Data node.

Features of HDFS

■ Fault tolerance

■ High reliability

■ High replication

■ Scalability

■ Distributed storage. MapReduce

It contains two tasks: map and reduce. Map gets a set of data and changes it into set of data in which the elements are broken down into Key-Value pairs. Then the reduce task takes input from the map and splits those tuples into smaller tuples and maps (Bhandarkar 2010).

Map - Splitting and Mapping Reduce - Shuffling and Reducing

Features of MapReduce

■ Local data processing

■ In-built redundancy

■ Independent language

■ MapReduce execution framework

■ Inter-process communication.

1.10.2 Apache Spark

It is an open-source and cluster computing framework. It is designed for fast computation and works on the concept of MapReduce. It has memory cluster computing which increases the processing speed of application.

Features of Spark

■ Speed

■ Advanced analytics

■ Swift processing

■ Dynamic in nature

■ In-memory computation

■ Reusability

■ Fault tolerance

■ Supports multiple languages like Java, R, Scala, and Python.

1.10.3 Apache Storm

It is a free and open source real-time distributed real-time computation system written in Java and Clojure. It is leading to real-time data analytics.

Features of Storm

■ Robust and user friendly

■ Real-time stream processing

■ Fault tolerance

■ Flexible

■ Reliable

■ Operational intelligence.

1.10.4 NoSQL Databases

In 1970s, Flat File Systems were used to store the data, but the problem is there is no standardization in the storage. It is a non-relational database or non-SQL database. It works on the mechanism apart from the tabular relations model for storing data. Preferably it is used to store real-time web applications and big data (Han et al. 2011).

Databases can be classified into three types:

  • 1. RDBMS
  • 2. Online analytical processing (OLAP)
  • 3. NoSQL.

Features of NoSQL

■ Performance is high.

■ Used as a familiar query language.

■ Less downtime.

■ Scalability is easy.

■ Flexible.

1.10.5 Cassandra

Here the data will be stored on many servers more than one replication factor so that the data will be available at all points of time without any downtime (Lakshman and Malik 2010).

Features of Cassandra

■ Fast writing

■ Replication

■ Schema-free

■ Transaction support

■ Application programming interface

■ Schema-free

■ Flexible data storage.

1.10.6 RapidMiner

It is an environment for data mining and machine learning (ML). It can be applied in both research and real-world data mining tasks (Hofmann et al. 2013)-


■ Multimedia mining

■ Text mining

■ Feature engineering

■ Datastream mining

■ Distributed data mining.


In this chapter, components of IoT devices, such as sensors, cloud servers, IoT gateway and physical devices, which are used in transmission of the data from the environment to the network, have been discussed. This chapter discusses big data analytics where data are generated continuously. Here the challenges in the generation of big data are classified, and different patterns of IoT data have been categorized. Media, business, and IoT are the major sources of big data generation. Big data system components such as data acquisition, data retention, data processing, data transport, data leverage are explained. Predictive, prescriptive, diagnostic, and descriptive analytics are used to take an optimal decision. Tools are also used in storing the data according to the format.


Al-Fuqaha, Ala, Mohsen Guizani, Mehdi Mohammadi, Mohammed Aledhari, and Moussa Ayyash. “Internet of Things: A survey on enabling technologies, protocols, and applications.” IEEE Communications Surveys & Tutorials 17, no. 4 (2015): 2347-2376.

Atzori, Luigi, Antonio Iera, and Giacomo Morabito. “The Internet of Tilings: A survey.” Computer Networks 54, no. 15 (2010): 2787—2805.

Bertsimas, Dimitris, and Nathan Kallus. “From predictive to prescriptive analytics.” arXiv preprint: 1402.5481 (2014).

Bhandarkar, Milind. “MapReduce programming with apache Fladoop.” In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp. 1—1. IEEE, Atlanta, GA, 2010.

Boyd, Danah, and Kate Crawford. “Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon.” Information, Communication & Society 15, no. 5 (2012): 662-679.

Cagan, Christopher L. “Method and apparatus for advanced mortgage diagnostic analytics.”

U.S. Patent 7,853,518, issued December 14, 2010.

Chen, Shanzhi, HuiXu, Dake Liu, Bo Hu, and Hucheng Wang. “A vision ofloT: Applications, challenges, and opportunities with China perspective.” IEEE Internet of Things Journal l,no.4 (2014): 349-359.

Deshpande, Amol, Carlos Guestrin, Samuel R. Madden, Joseph M. Hellerstein, and Wei Hong. “Model-driven data acquisition in sensor networks.” In Proceedings of the Thirtieth International Conference on Very Large Data Bases, Vol. 30, pp. 588-599. VLDB Endowment, Toronto, Canada, 2004.

Duan, Yanqing, John S. Edwards, and Yogesh K. Dwivedi. “Artificial intelligence for decision making in the era of big data - Evolution, challenges and research agenda.” International Journal of Information Management A8 (2019): 63-71.

Gandomi, Amir, and Murtaza Haider. “Beyond the hype: Big data concepts, methods, and analytics.” InternationalJournal of Information Management 35, no. 2 (2015): 137-144.

Govindan, Kannan, T. C. Edwin Cheng, Nishikant Mishra, and Nagesh Shukla. “Big data analytics and application for logistics and supply chain management.” Transportation Research Part E: Logistics and Transportation Review 114 (2018): 343-349.

Groger, Christoph, Holger Schwarz, and Bernhard Mitschang. “Prescriptive analytics for recommendation-based business process optimization.” In International Conference on Business Information Systems, pp. 25-37. Springer, Cham, 2014.

Han, Jing, E. Haihong, Guan Le, and Jian Du. “Survey on NoSQL database.” In 2011 6th International Conference on Pervasive Computing and Applications, pp. 363-366. IEEE, Port Elizabeth, South Africa, 2011.

Hazen, Benjamin T, Christopher A. Boone, Jeremy D. Ezell, and L. Allison Jones-Farmer. “Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications.” International Journal of Production Economics 154 (2014): 72-80.

Hofmann, Markus, and Ralf Klinkenberg, eds. RapidMiner: Data Mining Use Cases and Business Analytics Applications. Boca Raton, CRC Press, 2013.

Jagadish, Hosagrahar V., Johannes Gehrke, Alexandras Labrinidis, Yannis Papakonstantinou, Jignesh M. Patel, Raghu Ramakrishnan, and Cyrus Shahabi. “Big data and its technical challenges.” Communications of the ACM 57, no. 7 (2014): 86—94.

Kim, Ryan Yong, and Venkata Subba Rao Pathuri. “Setup of multiple IOT devices.” U.S. Patent 9,210,192, issued December 8, 2015.

Labrinidis, Alexandras, and Hosagrahar V. Jagadish. “Challenges and opportunities with big data.” Proceedings of the VLDB Endowment 5, no. 12 (2012): 2032-2033.

Lakshman, Avinash, and Prashant Malik. “Cassandra: A decentralized structured storage system.” A CM SIGOPS Operating Systems Review 44, no. 2 (2010): 35-40.

Lee, In, and Kyoochun Lee. “The Internet of Tilings (IoT): Applications, investments, and challenges for enterprises.” Business Horizons 58, no. 4 (2015): 431-440.

Mahmoud, Rwan, Tasneem Yousuf, Fadi Aloul, and Imran Zualkernan. “Internet of Things (IoT) security: Current status, challenges and prospective measures.” In 2015 10th International Conference for Internet Technology and Secured Transactions (ICITST), pp. 336-341. IEEE, 2015.

Marx, Vivien. “Biology: The big challenges of big data.” Nature 498 (2013): 255-260.

Mikalef, Patrick, Ilias O. Pappas, John Krogstie, and Michail Giannakos. “Big data analytics capabilities: A systematic literature review and research agenda.” Information Systems and e-Business Management 16, no. 3 (2018): 547—578.

Rialti, Riccardo, Giacomo Marzi, Cristiano Ciappei, and Donatella Busso. “Big data and dynamic capabilities: A bibliometric analysis and systematic literature review.” Management Decision 57, no. 8, (2019): 2052-2068.

Russom, Philip. “Big data analytics.” TDWI Best Practices Report, Fourth Quarter 19, no. 4 (2011): 1-34.

Stankovic, John A. “Research directions for the Internet of Things.” IEEE Internet of Things Journal 1, no. 1 (2014): 3-9.

Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T, Lowe, J., Shah, H., Seth, S. and Saha, B. “Apache Hadoop yarn: Yet another resource negotiator.” In Proceedings of the 4th Annual Symposium on Cloud Computing, p. 5. ACM, Santa Clara, California, 2013.

Waller, Matthew A., and Stanley E. Fawcett. “Data science, predictive analytics, and big data: A revolution that will transform supply chain design and management.” Journal of Business Logistics 34, no. 2 (2013): 77-84.

Zhou, Jun, Zhenfu Cao, Xiaolei Dong, and Athanasios V. Vasilakos. “Security and privacy for cloud-based IoT: Challenges.” IEEE Communications Magazine 55, no. 1 (2017): 26-33.

Zikopoulos, Paul, and Chris Eaton. Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Flill Osborne Media, Emeryville, 2011.

< Prev   CONTENTS   Source   Next >