Big Data Technologies
Nowadays there are a growing number of technologies used to aggregate, manipulate, manage, and analyze Big Data. These include mainly: NoSQL, Big Table, Apache Cassandra, Google File System (GFS), Apache Fladoop, MapReduce, and Mashup (Table 3-2).
NoSQL
The acronym NoSQL comes from the words “non SQL,” although it is often said “not only SQL.” NoSQL is a technology that enables the collection and use of nonstructured data. Data in such databases are modeled in a different way than tabular (used in relational databases). Non-relational databases perfectly fit into the Big Data trend. Unlike classic databases, they allow quick analysis of unstructured data and study of correlations between them. In the traditional database, the pattern and relationships are imposed from above. We can use structural SQL queries to provide structural answers within the framework described earlier. The latest trends show that it is worth collecting a variety of data, often unstructured, which may initially seem irrelevant but ultimately provide valuable business information. NoSQL enables users to transfer any data for later analysis without first preparing a data schema. These data can be used for various analyses and to discover potential correlations.
Table 3.2 Examples of Big Data Technologies
Technology |
Description |
NoSQL |
NoSQI is a technology that, unlike classic databases, allows quick analysis of unstructured data and testing the correlation between them. NoSQL facilitates transfer of any data for later analysis without first preparing a data schema. |
Big Table |
Big Table is a compressed, high performance, proprietary data storage system built on Google File System (GFS), Chubby Lock Service, SSTable (log- structured storage like LevelDB), and a few other Google technologies. The database was designed to be deployed on clustered systems and uses a simple data model that Google has described as "a sparse, distributed, persistent multidimensional sorted map." Data are assembled in order by row key, and indexing of the map is arranged according to row, column keys, and timestamps. Different compression algorithms help achieve high capacity. |
Apache Cassandra |
Apache Cassandra is a highly scalable, high- performance distributed database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. It is a type of NoSQL database. At the moment, Apache Cassandra is the most efficient NoSQL database of the "widerow" class while maintaining full scalability on any class equipment. |
Google File System |
GFS is a scalable distributed file system created by Google Inc. and developed to accommodate Google's expanding data processing requirements. GFS provides fault tolerance, reliability, scalability, availability, and performance to large networks and connected nodes. The GFS node cluster is a single master with multiple chunk servers that are continuously accessed by different client systems. Chunk servers store data as Linux files on local disks. Stored data is divided into large chunks which are replicated in the network a minimum of three times. The large chunk size reduces network overhead. |
Table 3.2 Continued
Technology |
Description |
Apache Hadoop |
An open-source software framework for processing huge data sets on certain kinds of problems on a distributed system. Open-source software is being developed within Apache™ Hadoop R project. It enables the processing of distributed large data sets in computer clusters using simple programming models. The Appache™ Hadoop project includes various modules such as: Hadoop Common, Hadoop Distributed File System, Hadoop YARN, and Hadoop MapReduce. |
MapReduce |
A software framework introduced by Google for processing huge data sets on certain kinds of problems on a distributed system. It allows parallel processing. The basic assumption of this model is to divide the problem into two main stages called mapping and reduction. The distributed file system from MapReduce enables data to be processed at the place of storage. |
Mashup |
An application that uses and combines data presentation or functionality from two or more sources to create new services. Mashup approach allows users to build ad hoc applications by combining several different data sources and services from across the web. |
However, high data availability in NoSQL databases is obtained at the expense of data consistency. Information gathered in different clusters may differ from each other. MapReduce, Hadoop, Cassandra or Hypertable are the examples of platforms that provide mechanisms for ad hoc and on-time extraction, parsing, processing, indexing, and analytics in a scalable and distributed environment (Chen, Chiang, & Storey, 2012).
Big Table
Big Table is a compressed, high performance, proprietary data storage system built on GFS, Chubby Lock Service, SSTable (log-structured storage like LevelDB), and a few other Google technologies. Big Table development began in 2004 and is now used by a number of Google applications such as web indexing, MapReduce, which is often used for generating and modifying data stored in Big Table, Google Maps, Google Book Search, Google Earth, Blogger.com, Google Code hosting, YouTube, and Gmail. Big Table was designed to support applications requiring massive scalability. From its first iteration, the technology was intended to be used with petabytes of data. The database was designed to be deployed on clustered systems and uses a simple data model that Google has described as “a sparse, distributed, persistent multidimensional sorted map.” Data are assembled in order by row key, and indexing of the map is arranged according to row, column keys, and timestamps. Different compression algorithms help achieve high capacity.
Big Table has had a large impact on NoSQL database design. Google software developers publicly disclosed Big Table details in a technical paper presented at the USENIX Symposium on Operating Systems and Design Implementation in 2006. Google’s thorough description of Big Table’s inner workings has allowed other organizations and open-source development teams to create Big Table derivatives, including the Apache HBase database, which is built to run on top of the Hadoop Distributed File System (HDFS). Other examples include Cassandra, which originated at Facebook Inc., and Hypertable, an open-source technology that is marketed in a commercial version as an alternative to HBase. (Google Big Table, from: https://searchdatamanagement.techtarget.com/definition/Google-BigTable).
Apache Cassandra
Apache Cassandra is a highly scalable, high-performance distributed database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. It is a type of NoSQL database. Apache Cassandra was created in 2008 by Facebook engineers. The rapidly growing number of users showed that traditional, relational database solutions are not able to provide adequate performance in data processing. Apache Cassandra is now the most efficient NoSQL database of the “widerow” class, while maintaining full seal- ability on any class equipment.
Apart from Cassandra, there are other NoSQL databases as well. The most well-known are Apache HBase and MongoDB. Apache HBase is an open source, non-relational, distributed database modeled after Google’s Big Table and is written in Java. It is developed as a part of Apache Hadoop project and runs on top of HDFS, providing Big Table-like capabilities for Hadoop. Then, MongoDB is a cross-platform document-oriented database system that avoids using the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas making the integration of data in certain types of applications easier and faster.
Google File System
GFS is a scalable distributed file system created by Google Inc. and developed to accommodate Google’s expanding data processing requirements. GFS provides fault tolerance, reliability, scalability, availability, and performance to large networks and connected nodes (Ghemawat, Gobioff, & Leung, 2003). GFS was designed to meet the rapidly growing demands of Google’s data processing needs. GFS grew out of an earlier Google effort, “BigFiles”, developed by Larry Page and Sergey Brin in the early days of Google. Files are divided into fixed-size chunks of 64 megabytes, similar to clusters or sectors in regular file systems, which are only extremely rarely overwritten or shrunk. GFS shares many of the same goals as previous distributed file systems, i.e., performance, scalability, reliability, and availability.
GFS is made up of several storage systems built from low-cost commodity hardware components. It is optimized to accommodate Google’s different data use and storage needs, such as its search engine, which generates huge amounts of data that must be stored.
The GFS capitalized on the strength of off-the-shelf servers while minimizing hardware weaknesses. The GFS node cluster is a single master with multiple chunk servers that are continuously accessed by different client systems. Chunk servers store data as Linux files on local disks. Stored data are divided into large chunks which are replicated in the network a minimum of three times. The large chunk size reduces network overhead. GFS is designed to accommodate Google’s large cluster requirements without burdening applications. Files are stored in hierarchical directories identified by path names. Metadata such as namespace, access control data, and mapping information are controlled by the master, which interacts with and monitors the status updates of each chunk server through timed heartbeat messages (retrieved from https://www.techopedia.com/definition/26906/ google-file-system-gfs).
Apache Hadoop
Apache Hadoop is an open-source software framework for processing huge data sets on certain kinds of problems on a distributed system. Open-source software being developed within the Apache™ Hadoop R project enables the processing of distributed large data sets in computer clusters using simple programming models. It has been designed to scale from one server to thousands of computers, offering the possibility of calculations and data storage. The Apache Hadoop software library is fault tolerant, designed to detect and handle faults in the application layer. Tlie Apache Hadoop project includes various modules. These are: (a) Hadoop Common—common tools that support other Hadoop modules; (b) HDFS— distributed file system that provides access to high-bandwidth application data; (c) Hadoop YARN—programming platform (Framework) for work planning and cluster resource management; and (d) Hadoop MapReduce—YARN-based system designed for parallel processing of large data sets.
MapReduce
MapReduce is a software framework introduced by Google for processing huge data sets on certain kinds of problems on a distributed system. It enables parallel processing. Tlie basic assumption of this model is to divide the problem into two main stages called mapping and reduction. The distributed file system from MapReduce allows data to be processed at the place of storage. Thanks to this solution, there is no need to send information from computers that store data to servers. Instead of sending large amounts of data, MapReduce is sent with a size of several kilobytes. With this solution, users can gain time that is lost if data were sent.
Mashup
Mashup is an application that uses and combines data presentation or functionality from two or more sources to create new services. Mashup approach allows users to build ad hoc applications by combining several different data sources and services from across the web.