VELOCITY, VOLUME, VARIETY, AND VERACITY OF DATA
In a 2001 article, Doug Laney from the Meta group (now part of Gartner) forecast a trend in data growth and information management opportunities. He used three V’s—velocity, volume, and variety—to identify the changes in information management that will give rise to big data technologies. 1 While these V’s represent many characteristics, “volume” is the dominant force behind big data (see figure 6.1). A fourth V—veracity—was introduced later to represent the fact that external data could have data integrity challenges.2 Other V’s have since been added, but I will focus the discussion here on volume, velocity, variety, and veracity. It is important to understand how the information management and analytics requirements are radically altering technological choices, as the older tools are not able to scale to the data tsunami described in chapter 3. However, IT organizations invested hundreds of billions of dollars in analytics solutions in the past. It is equally important to examine how these investments can be integrated and leveraged in the new marketing analytics solutions.
Let me start with the marketer’s requirements for a large Volume of big data. Most organizations were already struggling with increasing the size of their databases as the big data tsunami hit the data stores. According to Fortune magazine, we created five exabytes of digital data in record time until 2003. In 2011, the same amount of data was created in two days. By 2013, that time period was just 10 minutes.3 While the census data volumes were hidden from most marketers as they did not have access to the individual data, the other big data sources identified in chapter 3 challenged the limited IT capabilities. More than storage, the real issue at hand is the throughput requirement. Most of the data must be carried from its source to an analytics store and then used by statistical and unstructured analytics tools. The massive fire hose required to deal with the throughput sets the IT organizations on fire.
Last year, when my basement was flooded with 80 gallons of water, I tried initially to remove the water using towels. After removing about a gallon in approximately 30 minutes, I realized it would take me more
Figure 6.1 What is big data? than 40 hours to clean up the rest and would cause irreparable harm to my back. When I called the professionals, they brought industrial- quality pumps to remove the water in less than one hour. IT managers are in the same predicament. The extract, transform, and load (ETL) tools designed for business intelligence using structured data were not intended for these massive volumes. IT managers can either bring a set of massively parallel pumps to pull through the throughputs or ignore the big data pools around them. Like the sponge towel I was using in my basement, data integration and storage need a new set of tooling.
A decade ago, organizations typically counted their data storage for analytics infrastructure in terabytes. They have now graduated to applications requiring storage in petabytes. This data is straining the analytics infrastructure in a number of industries. For a telco with 100 million customers, the daily location data could amount to about 50 terabytes, which, if stored for 100 days, would occupy about 5 petabytes. In my discussions with one cable company, I learned that they discard most of their network data at the end of the day because they lack the capacity to store it. However, regulators have asked most telcos and cable operators to store call detail records (CDRs) and associated usage data. For a 100-million-subscriber telco, the CDRs could easily exceed 5 billion records a day. As of 2010, AT&T had 193 trillion CDRs in its database.4 For every CDR, there are ten registration records in their radio access networks (RAN), and for every RAN record, the deep packet inspection (DPI) data is about ten times larger. The good news is this data is like having hundreds of cameras tracking all consumer activities. The bad news is that it requires massive throughputs.
Some of this data can be aggregated or filtered at source. Let me take the example of web usage data. The data collected may include a web link, web page content, and other details that may carry nuggets of information that can be pulled out, while rest of the data can be discarded at source. However, a marketer must specify the nuggets of data that must be kept, and then a high-velocity engine rapidly filters the data at source to direct the meaningful data to a large data store and discards the rest. In addition, if this engine is examining the data in real time, it can also be used for providing real-time analytics, which would be useful for conversations with the customer.
There are two aspects to velocity, one representing the throughput of data and the other representing latency. Let me start with throughput, which represents the data moving in the pipes. The amount of global mobile data is growing at a 78 percent compounded growth rate, and is expected to reach 10.8 exabytes per month in 20165 as consumers share more pictures and videos. To analyze this data, the corporate analytics infrastructure is seeking bigger pipes and massively parallel processing. Latency is the other measure of velocity. Analytics used to be a “store and report” environment in which reporting typically contained data as of yesterday—popularly represented as “D-1.” Now, the analytics is increasingly being embedded in business processes using data- in-motion with reduced latency. For example, Turn (www.turn.com) is conducting its analytics in ten milliseconds to place advertisements in online advertising platforms.6
These flows no longer represent structured data. Conversations, documents, and web pages are good examples of unstructured data. Some of the data, such as that coming from telecom networks is somewhat structured, but carries such a large variety of formats that it is almost unstructured. All this leads to a requirement for dealing with high variety. In the 1990s, as data warehouse technology was introduced, the initial push was to create meta-models to represent all the data in one standard format. The data was compiled from a variety of sources and transformed using ETL (extract, transform, load) or ELT (extract the data and load it in the warehouse, then transform it inside the warehouse). The basic premise was a narrow variety and structured content. Big data has significantly expanded our horizons, enabled by new data integration and analytics technologies. A number of call center analytics solutions are seeking analysis of call center conversations and their correlation with emails, trouble tickets, and social media blogs. The source data includes unstructured text, sound, and video in addition to structured data. A number of applications are gathering data from emails, documents, or blogs. For example, Slice provides order analytics for online orders (see www.slice.com for details). Its raw data comes from parsing emails and looking for information from a variety of organizations—airline tickets, online bookstore purchases, music download receipts, city parking tickets, or anything a consumer can purchase and pay for that hits his/her email. How do we normalize this information into a product catalogue and analyze purchases?
Unlike carefully governed internal data, most big data comes from sources outside our control, and therefore suffers from significant data integrity problems. Veracity represents the credibility of the data source. If an organization were to collect product information from third parties and offer it to their contact center employees to support customer queries, the data would have to be screened for source accuracy and credibility. Otherwise, the contact centers could end up recommending competitive offers that might marginalize offerings and reduce revenue opportunities. Many social media responses to campaigns could be coming from a small number of disgruntled past employees or persons employed by the competition to post negative comments. For example, we assume that “like” on a product signifies satisfied customers. But what if a third party placed the “like”?7 Marketers as well as customers can find almost any information publicly. However, filtering for trustworthy information is an important task. Earlier, I discussed how simple rules such as the number of reviews on a Yelp page are indicators of veracity. When the marketer uses automated means to gauge market sentiments, these veracity filters are an important part of the solution to make the information trustworthy.
In the next sections, I will introduce big data technologies and discuss how they deal with large volume, velocity, variety and veracity of data.