INTRODUCTION TO BIG DATA
Since the birth of the computers, improving the computing performance and the processing speed of the computers has been an omnipresent topic. Ever since the
Big Data in Operation Management
Figure 1. Research roadmap

advent of big data, it has gained even more voice. Because of the limitation of physical devices, parallel computing was conceptualized and brought to life. It has gained a lot of fame since its advent and has now become a comprehensive field in computer science. The discipline includes the hardware concurrent designing and software concurrent designing. According to the hardware concurrent design for a computer, the different types of hardware architectures can be divided into single instruction stream and data stream (SISD); single instruction multiple data (SIMD) flows; multiple instruction stream data flow (MISD); the instruction stream and data stream (MIMD). The software concurrent design on the other hand includes the algorithms of concurrent design, high-speed internet clustering framework, and high-efficiency concurrent calculation model. Hardware concurrent design is skilled in numerical analysis and the floating point precision but it faces a technical bottleneck in case of big data, especially when the data is unstructured, while the software concurrent design, which includes parallel computing model and cluster system design, has been very successful in making up for the defects above.
For example, cluster system is constituted with the workstation or PC, which is linked via a high-speed network with a certain optimizing structure, and a unified dispatching visual human-computer interface, thus it is an efficient parallel processing system. These cluster systems have the good scalability and the system development cycle is little short. With a good programming model, the system can be very successful in coping with big data.
Hadoop is one of the most popular cluster systems, which includes the high- performance calculation model Map Reduce and Spark. Hadoop is an excellent and robust analytics platform for Big Data which can process huge data sets at a really quick speed by providing scalability. It can manage all aspects of Big Data such as volume, velocity and variety by storing and processing the data over a cluster of nodes whose numbers can run into thousands. The best thing about it is that the nodes in the cluster need not possess complex and costly hardware but simply commodity hardware which neither costly nor difficult to procure. Hadoop has a proper architecture, which consists of two components, MapReduce and the Hadoop.
MapReduce is a programming paradigm which can run parallel processing tasks on nodes in a cluster simultaneously and is able to both process and generate really large data sets with ease. It takes input and gives the output in form of key-value pairs. MapReduce is able to achieve all this by simply dividing each problem into two broad phases, the map phase and the reduce phase. The Map phase processes each record sequentially and independently on each node and generates intermediate key-value pairs. The Reduce phase takes the output of the Map phase. It processes and merges all the intermediate values to give the final output, again in form of key-value pairs. The output gets sorted after each phase, thus providing the user with the aggregated output from all nodes in an orderly fashion. This can be easily used for clustering large data sets. MapReduce programs have to be written for the query and the classification takes place on a cluster of interconnected computers, whose numbers may run into thousands.
The Hadoop distributed file system (HDFS) is a highly scalable and distributed file system for Hadoop. HDFS is able to store file big in size across multiple nodes in a cluster. It is reliable as is replicates the data across the nodes, and hence theoretically does not require Redundant Array of Integrated Devices (RAID) storage. The default replication value is 3, that is, the data is stored on 3 nodes. Two of these are on the same rack and one is on a different rack. The jobs are done by two different tasks, the Job tracker (master) and the Task tracker (slave). The Job tracker schedules MapReduce jobs to the Task trackers which know the data location. The intercommunication among tasks and nodes is done via periodic messages called heartbeats. This reduces the amount of traffic flowing through the network. Hadoop can also be used with other file systems, but then these advantages are not available
With the introduction of YARN (Yet Another Resource Negotiator) in later releases, Hadoop was integrated with a number of wonderful projects which can be used for storing, processing and analyzing data a lot more efficiently, thus aiding in exploration of data for undiscovered facts at a smooth pace. Hadoop is an open source project written in Java and is currently under the Apache Software Foundation and is free to use. The initial design idea is mainly used to deploy on the cheap hardware. The framework implements a distributed file system, referred to as “HDFS,” which has high fault tolerance and high speed and scalability to store large amounts of data and implements a computation model of parallel processing large data sets; this model is of high speed computing in big data processing field. In addition to these advantages, the distributed system framework is open source software. With the advent of big data, it has become a good and reliable solution, which can store and process big data with ease. While in this day and year it has become a famous big data processing framework in the field of the big data analytics, at the same time, Hadoop ecosystem, with continuous improvement and optimization, has become better. With the advent of second generation, a lot more projects have been introduced into the hadoop ecosystem.
After collecting the data, it has to be processed so that meaningful information can be extracted out of it which can serve as decision support system. Therefore the analysts need to come up with a good technique for the same. One way of achieving this is MapReduce, it permits the filtering and aggregation of data stored in the HDFS so as to gain knowledge from the data. However, writing MapReduce requires basic knowledge of Java along with sound programming skills. Assuming one does possess these skills, even after writing the code, which is itself a labor intensive task, an additional time is required for the review of code and its quality assessment. But now, analysts have additional options of using the Pig Latin or Hive- QL. These are the respective scripting languages to construct MapReduce programs for two Apache projects which run on Hadoop, Pig and Hive. The benefit of using these is that there is a need to write much fewer lines of code which reduces overall development and testing time. These scripts take just about 5% time compared to writing MR programs. Although Pig and Hive scripts are 50% slower in execution compared to MR programs, they still are very effective in increasing productivity of data engineers and analysts by saving lots of time. Pig is a Hadoop project which was initially developed by Yahoo! and is now open source and free to use under Apache. Its scripting language is called Pig Latin. The script consists of data flow instructions which are converted to MapReduce instructions by its framework and used to process data. Hive is a Hadoop project which was initially developed by facebook and it is now an open source project under Apache. It has its own UI called Beeswaxand, its script is called Hive-QL which is very similar to the SQL being a declarative dialect. The script is then converted to MR instructions and the data is processed. Hive helps in querying and managing large dataset. Pig Latin is the
Big Data in Operation Management
scripting language to construct MapReduce programs for an Apache project which runs on Hadoop. The benefit of using this is that there is a need to write much fewer lines of code which reduces overall development and testing time. However in order to improve the performance of the distributed system framework, many companies and supporters provide their first-class good components and high performance code for Hadoop, such as YARN, Hcatalog, Oozie, and Cassandra, which make the performance of Hadoop become more and more strong, and the application field of Hadoop is becoming more and more wide.
Example
Finding maximum word count through Map reduce programming
Input: Text File
Output: Total word count
Mapper phase: (see Figure 2)
Reducer Phase: (see Figure 3)
Driver Class: (see Figure 4)
Output: (see Figure 5)
R language, as a data analysis and visualization tool, is widely used for statistical analysis, drawing, data mining, machine learning, and visualization analysis. Particularly, R tool has been built in a variety of statistical, digital analysis functions and drawing functions. R language is an open source software, and very useful
Figure 2. Mapper class

Figure 3. Reducer class

Figure 4. Driver class

third-party packages which are written by the developers in it’s community around the world can be downloaded and used for free. Therefore, its application field is wide, from statistical analysis, to applied mathematics, econometrics, financial analysis, financial analysis, human science, data mining, artificial intelligence, bioinformatics, biomedical, data visualization and many more (see Figures 2-5).
Figure 5. Output
