Big data operation management system is a virtualized ecosystem consisting of cloud and operational applications to understand what the growing concern concept is across IT systems.

Big data operation management = clustering analysis of big data + collaboration filtering of big data

For example, through recommendation system a user can choose item to purchase, subscribe, invest, or any other venture that the user may need suggestions for through building a recommendation system. Such systems may be personalized for the user to base recommendations based on data specific to the user. A great way of building a reliable recommender system for the users is collaborative filtering. Collaborative filtering is defined as a technique that filters the information sought by the user and patterns by collaborating multiple data sets such as viewpoints, multiple agents and pre-existing data about the users’ behavior stored in matrices. Collaborative filtering is especially required when a huge data set is present. The collaborative filtering methods are used to create recommender systems for a wide variety of fields with lots of data having varied formats such as sensing and monitoring of data in battlefields, line of controls and mineral exploration; financial data of institutions that provide financial services such as banks and stock markets; sensing of large geographical areas from which data is received from all kinds of sensors and activities; ecommerce and websites where the focus is to recommend products to users to increase sales, to name a few.

A definition of collaborative filtering which is somewhat newer and a bit narrow in sense states that it is a way of automating the process of making predictions, a process which is known as filtering, about the preferences and dislikes of a user by collecting data from as big a number of users as possible, a process which is known as collaborating, hence it is given the name collaborative filtering. The underlying assumption of the collaborative filtering approach is that if a person has the same opinion of an issue as a person B, A is more likely to have an opinion similar to B’s opinion on a related but different issue. It is noteworthy that such predictions are specific to the user, but using data from a number of users forms them. The personal information of the user such as age, gender and location are generally not used in collaborative filtering (CF) but a partially observed matrix of ratings is used. The rating matrix may be binary or ordinal. The binary matrix contains the ratings by the users in columns in the form of likes or dislikes while the user’ name or id is in the rows. The ordinal matrix contains ratings in form of a number of responses form the user such as excellent, very good, good, average, poor or simply in form of stars out of five or ten, a system that is used a lot in this day and age. The website’s server, for example using click stream logging, can easily gather the rating matrix implicitly. Clicks on links to pages of goods or services being provided can be considered to be positive review of the user. While rating matrices can prove to be really handy, one major drawback they have is that they are extremely sparse, so it is very difficult of club similar users together in classes. This is so because each and every user does not give the reviews about each and every product. Thus collaborative filtering consists of storing this sparse data and analyzing it to create a recommendation system.

Cluster analysis or clustering is the exercise of taking a set of objects and dividing them into groups in such a way that the objects in the same groups are more similar to each other according to a certain set of parameters than to those in other groups. These groups are known as clusters. Cluster analysis is one of the main tasks in the field of data mining and is a commonly used technique for statistical analysis of data. Cluster analysis does not refer to an algorithm but an exercise that has to be undertaken on the given data set. Various algorithms can be used for cluster analysis. The algorithms are divided into various categories and they differ significantly in their idea of what a cluster is constituted of and how the clusters are identified. The most popular ideas on the basis of which clusters are defined and identified include groups with small distances among the constituent members, areas of the data space which are highly dense, intervals or particular distributions. Clustering is a multi-objective problem that it is a mathematical optimization problem. A clustering algorithm consists of parameter settings such as a distance function, a density threshold (the number of clusters expected to be formed). Based on the available data set and the use of result as intended by the user, apt clustering algorithm may be used. For example, It is a perpetual task in search engine to group similar objects into distinct clusters and dissimilar object away from the clusters. Thus, clustering algorithms are an essential constituent for making a well performing search engine. The clustering is used to provide data to the users of the search engine as they post queries to search on various topics. The results are given to the user by the engine according to the similar objects in the particular user’s cluster using previously gathered data about preferences of similar users. The better the performance of the clustering algorithm is for the users, the more the chances are that the users are able to find the thing they are looking for on the very first page itself and they don’t have to spend time looking up further results. Therefore the definition based on which the algorithm forms clusters and defines objects has to be spot on to get the best results. The better this performance is, the more users are attracted to the search engine (see Figure 6).

Big data operation system focuses on collection of following data (Harzog, 2015):

  • 1. Real Time Data Collection
  • 2. Real time event processing
  • 3. Comprehensive data collection
  • 4. Deterministic data collection
< Prev   CONTENTS   Source   Next >