Steps of Big Data Processing

Various processes involved in data analytics, from data collection to virtualization, in a phase-wise manner are as follows (Figure 3.3):

Data Collection

The collection process depends on the type of data from where we had to collect it and also on the type of data sources. Social sites usually produce unstructured data. In order to collect data from various sources, we have different tools available, and some of the common tools are as follows:

  • 1. semantria
  • 2. opinion Crawl
  • 3. open Text
  • 4. trackur

Semantria is used for text and sentiment analysis. It is an NLP (Natural language processing that have the ability to program a computer system to understand human language as it is spoken, it used technical techniques such as tokenization, parsing, tagging parts-of-speech, identifying sentiments or semantic relationships) based analysis engine which can be deployed in web, cloud. API (Application Programming Interface provide set of programming code that enables data transmission between one software product and another), etc. This is a propriety tool. Opinion Crawl is used for sentiment analysis with sense bot-based analysis engine. This tool can be deployed in web only and is an open source tool which is readily available. OpenText tool is used for content management and analysis which uses Red Dot and Captive analysis engine. It can be used for window-based server application and is at the enterprise level not an open source. Trackur is used for influence and sentiment

Various steps for data processing

FIGURE 3.3 Various steps for data processing.

analysis which uses trackur as the analysis engine. This can be used for web or social media-based application and is a propriety tool (Kornal, 2018).

Data Storage and Management

Storage of these large data is also a challenging task; some of the traditional methods of data storage are relational databases, data marts, and data warehouses. The data are uploaded to the storage from operational data stores using Extract, Transform, Load (ETL), or Extract, Load, Transform, tools which extract the data from outside sources, transform the data to fit operational needs, and finally load the data into the database or data warehouse. Thus, it can be concluded that any data which are stored are first cleaned, transformed, and cataloged before they are made available for mining purpose or analytical purpose. Some of the common data storage and management tools available are as follows:

  • 1. Apache HBase (Hadoop database)
  • 2. CouchDB
  • 3. MongoDB
  • 4. Apache Cassandra
  • 5. Apache Ignite
  • 6. Oracle NoSQL Database

Apache Hbase (Hadoop database) is a column-oriented data model which provides zero downtime during node failure and thus provides good redundancy. It provides concurrency by means of optimistic concurrency. CouchDB is a document-oriented data model which also provides concurrency by means of optimistic concurrency and also provides secondary indexes. MongoDB is also a document-oriented data model which provides nearly the same features as CouchDB. Apache Cassandra is a column-oriented data model which provides zero downtime on node failure and hence provides good redundancy to the system. It also provides concurrency to the system. Apache Ignite is a multi-model data model w'hich provides nearly all the features such as zero downtime on node failure, concurrency, and secondary indexes and hence mostly in use. Oracle NoSQL Database is a key-value-based data model which provides concurrency and secondary indexes.

Data Filtering and Extraction

The data filtering and extraction process is used for creating structured data from unstructured data collected from previous steps. Various tools that are used for data filtering and extraction purposes are as follows:

  • 1. Pentaho
  • 2. OctoParse
  • 3. ParseHub
  • 4. Mozenda
  • 5. Content Grabber

Pentaho is used to produce a structured output from unstructured data. This tool has ETL and data mining capabilities. It is available as a free and enterprise version based on the number of functionalities of the tool in use. OctoParse is used for spreadsheets to produce structured spreadsheets. It has a feature of webscrapping. It is also available as a free and unpaid version. The ParseHub tool can be used for preparing structured data of Excel, CSV (comma-separated values file), and Google sheets. It is a cloud-based desktop application. Mozenda is used for producing the structured data for JSON (JavaScript Object Notation is a lightweight data-interchange format), XML, and CSV file. It also has a feature of webscrapping. However, Content Grabber is used for preparing structured data for CSV, XML. and databases. It also has an additional feature of webscrapping with debugging and error handling.

Data Cleaning and Validation

The next step after data filtering and extraction is data cleaning and validation. Data cleaning is used to reduce the processing time and computational speed of analytical tools and engines. It is not a mandatory condition to use this tool. Some of the latest and mostly used cleaning tools are as follows:

  • 1. Data Cleaner
  • 2. Map Reduce
  • 3. Rapidminer
  • 4. OpenRefine
  • 5. Talend

Data Cleaner is used for record and field processing with additional features of data transformation, validation, and reporting. This tool is integrated with the Hadoop database. Map Reduce is a parallel data-processing model which has additional features of searching, sorting, clustering, and translation. This tool is also a part of the Hadoop database. Rapidminer is a graphical user interface and a batch-processing model with additional features of filtering, aggregation, and merging. It is used for internal database integration. OpenRefine is a batch-processing model with additional features of transforming data from one form to another. It can be used with web services and external data. Talend is a streaming and batch-processing model with an additional feature of data integration. It can be used with numerous databases.

Data Analytics

Collection of large data is not the only solution for taking efficient decision at right time; there is a need for faster and efficient methods for analyzing large data in a faster and efficient manner. Since traditional methods are not so efficient for analyzing such large data, developing new tools and techniques for big data analytics along with advanced architecture for storing and managing such large data is the need of hour. Elgendy et al. (2016) proposed a big data analytics and decision framework.

which integrates big data analytics tools with a decision-making process to provide efficient results in an optimized time period. In recent years, a large number of tools, which provide many additional properties apart from analytics, have been developed. Some of the tools discussed below are as follows:

  • 1. Hive
  • 2. Apache Spark
  • 3. Apache Storm
  • 4. Map Reduce
  • 5. Qubole
  • 6. Flink

Hive is a streaming processing model which supports structured query language (SQL) and provides a high latency to the system. Apache Spark is a mini/micro-batch, streaming model which uses scala java and python language for operation and integration. Apache storm is another version of Apache which is a record at a time-processing model which uses any language for integration and operation and provides better latency than Apache Spark. Map Reduce is a parallel-processing model which uses languages like Java, Ruby, Python, and C++ for its operation. Qubole is a stream-processing and ad-hoc query-based processing model which supports languages like Python, Scala, R, and Go. Flink is a batch and streamprocessing model which supports languages like scala, java, and python for its operation.

Data Visualization

Now when analytics of data is performed, one of the last operations requires visualization of these analyzed data in a readable format. The data visualization process involves visualizing the data in a form that can be readable. There are various tools used for the data visualization process but most of them are integrated versions of data extraction, analysis, and visualization. Some of the common tools used for data visualization are as follows:

  • 1. Data Wrapper
  • 2. Tableau
  • 3. Orange
  • 4. Qlik"
  • 5. Google fusion table
  • 6. CartoDB
  • 7. Chartio
  • 8. Gephi

Data wrapper is an open source tool which is compatible with CSV, PDF, Excel, and CMS. It has ready-to-use codes which produce output in the form of bar charts, line charts, maps, and graphs. Tableau is also an open source tool which is compatible with the database and API which produce output in the form of maps, bar charts, and scatter plots. Orange is also an open source tool with data source compatibility of Apache Cassandra files, SQL tables, and data tables or can paint random data which need no programming and can be used to produce output as scatter plots, bar charts, trees, dendrograms, networks, and heat maps. Qlik is licensed software with data source compatibility of databases, spreadsheets, and websites. It produces the output as dashboard and app. Google fusion table is a Google’s web service which supports data source compatibility of comma-separated value file formats. It produces the output in the form of pie charts, bar charts, line plots, scatter plots, and timelines. CartoDB is also an open source tool with data source compatibility of location data, plenty of data types which use CartoCSS language, and results in the form of maps. Chartio is again an open source tool with multiple data sources as data source compatibility which uses its own visual query language and produces output as line, bar, pie charts, and dashboard sharing as PDF reports. Gephi is again an open source tool which uses CSV, GraphML. GML. and GDF spreadsheet as data source compatibility and produces the output as graphs and networks.

 
Source
< Prev   CONTENTS   Source   Next >