III: Further Considerations on Making Sense of Data

Unfocused Analytics (A Big Data Analysis) vs. Focused Analytics (Beginning with a Hypothesis)

How can the data be described in terms of Volume, Variety, Velocity, and Veracity? An in-depth exploration can be done through big data when it is still not clear what the hypothesis is that wanted to test or if there is no explicit knowledge about the data and therefore does not know where to start from or what kind of data could be discarded. Working with big and complex data sets is now possible and viable due to advances in computing. From a large amount of data, it is possible to uncover hidden patterns, extract correlations, and deliver many other insights to orient the conduction of the studies.

Usually, programming languages and traditional data mining software are not prepared for dealing and processing big, huge, and gigantic data sets. Also, transferring and extracting these huge data sets can be very difficult and time-consuming. In the mining industry, historians and other types of data are collected in a very high frequency, generating terabytes of data. Some tools and frameworks worth mentioning for this kind of issue are Apache Spark and Hadoop. Both are idealized to deal with elastic scalability, and all the data exploration and preparation steps can be conducted directly in the data storage. Usually, there is a Data Engineer, responsible for the architecture and environment, who will work together with data scientists and data analysts to develop a big data project. Data engineers tend to focus on software engineering, database design, production code, and ensure that data are generally flowing from the source, from where they are collected, to the destination, where they are processed by data scientists.

However, when there is an initial hypothesis to attack the problem or if the partner business team can help to provide some of the hypothesis they want to validate, the work will have a focused analysis. Knowing the question, having the answer is the best strategy for effective analytics. It is essential to collect the questions and the answers that are wanted to be found. The quality of the questions will be crucial to the quality of the insight that takes as a result.

Many data transformations can occur to promote the tests of the established hypothesis. The critical point is to investigate and to verify if a statement about the data can be checked with the available information. Moreover, if not, how can one work on to reach this goal by applying techniques to augment the knowledge about the data, like using a different scale for a numeric field, or generate new variables using the current ones.

The big step in this phase is to maintain a goal in mind and analyze which kind of data is needed to validate some hypotheses; this can be done in many ways, for example, looking for a correlation between numerical data.

< Prev   CONTENTS   Source   Next >