ANALYTICS AND STORAGE ENGINES ON MASSIVELY PARALLEL PLATFORMS FOR HIGH-VOLUME PROCESSING
Big data usually shows up with a data tsunami that can easily overwhelm a traditional analytics platform designed to ingest, analyze, and report on typical customer and product data from structured internal sources. In order to meet the volume challenge, we must understand the size of data streams, the level of processing, and related storage issues. The entire analytics environment must have the capabilities to deal with this data tsunami and should be prepared to scale up as the data streams get bigger.
The use of massively parallel computing for tackling data scalability is showing up everywhere. In each case, the underlying principle is a distribution of workload across many processors, as well as the storage and transportation of underlying data across a set of parallel storage units and streams. In each case, the manipulation of the parallel platform requires a programming environment and an architecture, which may or may not be transparent to the applications.
Let me offer a metaphor to introduce MPP. During Thanksgiving, my son was in town, and we decided to cook together. We were in a hurry as we wrote down the grocery list. As I went to the grocery store to purchase Thanksgiving dinner items along with my wife and son, we decided to distribute the task, each person covering a couple of aisles to get part of the groceries. We decided to meet back at the checkout counter where the first one to return should get in line. In this case, by working in parallel, we cut down our overall time by a factor of three. We zigzagged through the aisles and sometime overlapped our paths, but still were far faster than one person going through the entire store with the grocery list. As we split the task, we divided the grocery list by zones, one person taking care of the vegetables, the second one taking care of frozen items, and the third dealing with wines. What I am demonstrating here is the basic principle of MPP. There is more than one agent working in parallel. We used zones to divide the task, and we combined our work at the checkout counter. Now, imagine the same work with tens, maybe hundreds, of us working together in the same fashion.
MPP is being applied to a number of areas. In the last section, I discussed stream computing. A data engineer may set a large number of processors in parallel to count, filter, or score streams of data. The parallel operation facilitates a much larger throughput. Once the data is directed to a data storage device, it may use MPP to write the data in parallel and also build its transformation using a set of parallel processes. If the data requires a sophisticated predictive data modeling activity, the statistical engine may convert the data crunching to a set of queries to be performed inside the data storage device using a set of parallel processes. For true big data performance, I may design an architecture in which each element in the data flow is in the MPP environment. Each element may choose different strategies for implementing MPP and yet provide an overall integrated MPP architecture.
Let me start with the platform for large-scale data integration. Any environment facing massive data volumes should seriously consider the advantages of MPP computing as a means to host their data integration infrastructures. MPP technologies for data integration are increasingly providing ease of setup and use, unlimited linear scalability to thousands of nodes, fully dynamic load node/pod balancing and execution, the ability to achieve automatic high availability/disaster recovery (HA/DR), and much lower price points at which comparable performance of traditional symmetric multiprocessing (SMP) shared memory server configurations can be achieved.
I n stream computing implementations, continuous applications are composed of individual operators, which interconnect and operate on one or more data streams. Data streams normally come from outside the system or can be produced internally as part of an application. The operators may be used on the data to have it filtered, classified, transformed, correlated, and/or fused to make decisions using business rules. Depending on the need, the streams can be subdivided and processed by a large number of nodes, thereby reducing the latency and improving the processing volumes.
An MPP data warehouse (in the analytics engine) can also run advanced queries so that all the predictive modeling and visualization functions in the engine can be performed. The stored data is typically too large to ship to external tools for predictive modeling or visualization. The engine performs these functions based on commands that are given by predictive modeling and visualization tools. These commands are typically translated into native functions (for example, structured query language [SQL] commands), which are executed in a specialized MPP hardware environment to deal with high-volume data. Analytics engines carry typical functions for ELT (organization of ingested data using transformations), the execution of predictive models and reports, and any other data-crunching jobs (for example, geospatial analysis). The data storage architecture can be built using a two-tiered system designed to handle very large queries from multiple users. The first tier is a high-performance symmetric multiprocessing host. The host compiles queries received from business intelligence applications and generates query execution plans. It then divides a query into a sequence of subtasks, which can be executed in parallel, and it distributes the subtask to the second tier for execution. The host returns the final results to the requesting application, thus providing the programming advantages while appearing to be a traditional database server. The second tier consists of dozens to hundreds to thousands of processing units operating in parallel. Each processing unit is an intelligent queryprocessing and storage node, and consists of a powerful commodity processor, dedicated memory, disk drive, and field-programmable disk controller with hard-wired logic to manage data flows and process queries at the disk level.
Hadoop owes its genesis to the search engines, as Google and Yahoo required massive search capabilities across the Internet and addressed the capability of searching in parallel with data stored in a number of storage devices. Hadoop offers the Hadoop Distributed File System (HDFS) for setting up a replicated data storage environment, and MapReduce, a programming model that abstracts the problem from disk reads, and writes and then transforms it into a computation over a set of keys and values. With the open source availability, Hadoop has rapidly gained popularity.
When dealing with high volumes and velocity, we cannot leave any bottlenecks. All the processes, starting with data ingestion, data storage, and analytics and its use, must meet velocity and volume requirements. Some of these systems are designed to be massively parallel and do not require configuration or programming to enable massively parallel activities. In some cases, such as Hadoop, the parallel processing requires programming using special tools, which exploit the parallel nature of the underlying environment (in this case, HDFS). The Hadoop development environment includes Oozie, an open-source workflow/coordination service to manage data processing jobs; HBase for random, real-time read / write access to big data; Apache Pig for analyzing large data sets; Apache Lucene for search; and Jaql for query using JavaScript® Object Notation (JSON). Each component leverages Hadoop’s MapReduce for parallelism; however, this elevates the skill level required for building applications. To make the environment more user-friendly, big data vendors are introducing a series of tools, such as Big Sheets from IBM, that help visualize the unstructured data.