Big Data – external and internal

As I mentioned in Chapter 1, “Big Data” is a buzz phrase widely bantered around with, in many instances, little definition of what is “big.” My discussion put some content behind a definition in terms of Volume, Velocity, and Variety. That discussion was an almost standard one.

Although there is much discussion about the variety component, at the risk of oversimplification, I will categorize it into three classes as illustrated in Figure 2.1: numerical data, imaging data, and textual data. Images are photos, videos, and multimedia. Numerical data are any numbers as well as text that may logically be associated with the numbers. This includes, for instance, a product stock keeping unit (SKU), an order number, or a customer ID. The last three are usually alphanumeric codes. The text class is strictly text, although numbers could certainly be embedded in the text strings. Customer reviews, call center logs, warranty claim notes, and product return notes are a few examples. The numeric and text data are particularly important for new product development, but at different points in the overall process: text in the ideation stage (the focus of this chapter) and numerical data in the pricing and tracking stages. See Mudambi and Schuff [2010| for an analysis of customer reviews on and what aspects of reviews make them useful and informative.

The data in these three classes are a combination of internal and external origination (another artificial distinction). Internal data are generated by the order processing system when a customer places an order. Price point, discount rates,

There are three major components of Big Data

FIGURE 2.1 There are three major components of Big Data: numerical data, imaging data, and textual data. The imaging data could consist of videos and pictures. Textual data are any kind of text ranging from a word or two to whole books.

SKU code, dates, payment options, and so forth are automatically attached to the order by well-established internal protocols of the ordering, tracking, accounting, and financial systems. Most, if not all, of these are hidden from the customers. External data, however, are generated solely by the customers and may not even be maintained by the business. The external data could be product reviews on a product review web site which is divorced from the business or they could be comments on social media. Certainly, they could be on the business’s own web site but nonetheless generated external to the company. The particular use of these pieces for ideation is often little discussed although some research is beginning in this area. See Liu and Lu [2016] for an example. They focus on crowdsourcing which has several definitions but all fundamentally referring to online reviews by a “crowd” of customers. Liu and Lu [2016], for example, refer to it as “the process of soliciting inputs from a large group of online users via the internet-based platforms.” Also see Evans et al. |2016] for a review of the crowdsourcing literature.

For what part of a business’s operations are numerical data to be used? Videos? Text? There seems to be little guidance and discussion that I am aware of. The three pieces are illustrated in Figure 2.1.

The use of the text component is discussed in the next section and then again in Chapter 7. The use of the numeric component is discussed in Chapter 7.

Text data and text analysis

Text data are usually characterized as unstructured. To understand why, it is necessary to first understand the structured nature of non-text data; i.e., numerical data. Numerical data are structured in that they are stored in well-defined fields with well-defined formats based on well-defined rules and protocols. For example, a transactions database would have the total number of units sold and the total invoice amount. The former would be another field without a decimal format (of course, depending on the product) while the latter is in one field with a two-decimal place format reflecting dollars and cents. The data would be maintained in a data table consisting of rows and columns. For transactions data, the rows would be individual orders while the columns would be the order and invoice amount. Other columns would be included to indicate an order number and transaction date. Another data table would have a column that represents the order number and an additional column for a customer ID. Yet another data table would repeat this customer ID plus have columns for the customer name and address. All the data tables are linked on order number and customer ID. These data are structured in that each datum goes in a specific table, in a specific place in that table, with a specific format, and for a specific purpose. See Lemahieu et al. [2018] for some discussion about structured and unstructured data for database design.

Not all non-text data are numerical. Corporate databases of transactions and customers have always contained text in addition to numerical data. These text data, however, like the numerical data, are structured. Names and addresses are stored in well-defined fields and tables with well-defined formats based on well-defined rules and protocols. Addresses, for example, would be:

  • • address 1 for street;
  • • address 2 for apartment or suite number;
  • • address 3 for possible third part of address such as a post office box number;
  • • city;
  • • state or province;
  • • country; and
  • • ZIP or postal code.

The same holds for product descriptions and email addresses. This text may best be labeled as character strings or character data to distinguish it from the text data computer scientists, IT managers, and data analysts worry about when they deal with Big Data.

These numeric and character data follow rules for storage and format. There are exceptions to the rules, but most data processing proceeds with few glitches or concerns. True text data are different. They do not follow any rules because the text can contain any type of characters, numeric or strings in any order, and be of any length. In addition, they may make intelligent sense or not. By intelligent sense, I mean that what is stored may be something that a normal person could read and immediately understand; text that basically follows proper grammar and literary rules with complete sentences and correct punctuation. In short, they make sense. They also may not follow grammar rules, be short abbreviated statements with strange letterings (e.g., “bff” for “best friends forever”; “lot” for “laughing out loud” or maybe “lots of luck”; or “IMHO” for “in my humble opinion"), contain foreign words or Latin words and expressions such as “ergo” for “therefore” or “post hoc ergo propter hoc” for “after this, therefore because of this,” and they could be any length from a single word to several pages without punctuation or paragraph indication. In short, they are free-form. These types of data are unstructured. This makes storage and formatting difficult to say the least.

Despite being unstructured in form and format, there is actually a structure in arrangement which helps analysis. The arrangement is documents within a corpus which may be contained in a collection called a corpora. I discuss these arrangements in the next subsection.

Documents, corpus, and corpora

Text data per sc at one level are unstructured as I just mentioned, yet at another level they are structured. It is from this higher level of structure that even more structure is imposed on text data so that traditional statistical analysis, primarily multivariate analysis, can be used to extract Rich Information useful for product ideas. The higher-level structure involves documents, a corpus of documents, and

The raw, unstructured text data are contained in documents, which are contained in a corpus

FIGURE 2.2 The raw, unstructured text data are contained in documents, which are contained in a corpus. If there are multiples of these containers, then they form a corpora. Two sets of raw text in two sets of documents are shown here. The documents form each corpus which themselves form a single corpora.

perhaps a corpora (the plural of corpus). This is a hierarchical structure illustrated in Figure 2.2.

In most instances, there is one corpus composed of many documents. As an example, a consumer survey could ask respondents to state why they like a particular product. The responses would be free-form text, the unstructured text in the bottom row in Figure 2.2. Each response is a document. Each document could be just a word (e.g., “Wonderful”) or many words depending on what the respondent said about the product. The entire survey is the corpus. If the survey is done multiple times (e.g., once per year), then all the surveys are a corpora. For product reviews on a website, one review by one person is a document while all the reviews on that site is a corpus. Reviews on several sites for that same product is a corpora. These product review documents could also be collected into one data table with, say, two columns: one for the source and the other for the text.

Organizing text data

Since text data, as unstructured data, are fundamentally different from numeric data and other structured data, it would seem that the organization of text data should be different from other types of data. This is, in fact, the case. Consider structured data. By the definition of being structured, these data are maintained in a well- defined relational database. The database is relational in the sense that the database consists of tables, each table containing a specific set of information. The tables are linked or “related” based on keys. There are primary keys and foreign keys that allow you to merge or join two or more tables to answer a query. The primary key is an identifier attached to and is characteristic of the table involved. It identifies a specific element in the table. For example, for a customer table in a transaction database, a customer ID (CID) is a primary key for this table because it is directly associated with the data in that table. It identifies a specific customer. An order table in the same transaction database would have an order number (ONUM) that identifies a specific order so it is a primary key. But the CID is also in the order table because the customer placing the specific order must be identified. The customer ID is a foreign key because it identifies a record in another table. A transaction database, several tables in that database, and primary and foreign keys are displayed in Figure 2.3.

A useful paradigm for a relational database is a building. The building can be expanded in one of two ways: add more floors or add more wings.3 Adding floors is vertical scaling while adding wings is horizontal scaling. A relational database is vertically scalable to handle more data because it can be expanded by adding more capacity to a server: larger disk storage, more RAM, faster processor. A query language does not care how large the database is. Horizontal scaling is different, however, in that new data structure and organization may have to be added. See Lemahieu et al. [2018| on horizontal and vertical scalability.

Not all databases are relational. By the nature of the data they store, they cannot be relational. Text data, videos, images are examples of data types that do not fit the relational database model because the data do not fit into well-defined

This shows three tables in a transaction database. Normally, more than three tables would describe an order. The keys are

FIGURE 2.3 This shows three tables in a transaction database. Normally, more than three tables would describe an order. The keys are:


Primary Key

Foreign Key








structures nor can keys, primary or foreign, be defined for the entries. How do you define a primary key for a customer review for a product? As a result, these databases are unsearchable based on primary keys and can only be scaled horizontally by adding more servers to handle the increased load. These databases are called document databases where the documents could be text, videos, images. For our purposes, documents are just text.

Some document databases do have keys in the sense that the data are stored by key.value pairs. The key is a unique identifier for the record containing the text, and so is record specific, while the value is the document itself. The keys could be date-time stamps indicating when the text was either created or entered in the database. They allow for faster accessing of the values and are thus a more efficient way to organize the data.

A query language most commonly used to access relational tables, joining them where necessary based on the keys, to address a query is the Structured Query Language (SQL). SQL has a simple, human language-like syntax that allows a user to request data from a table or combination of tables in an intuitive manner. The primary component of a SQL statement is the Select clause. As an example, you would select a variable “X” from a data table named “dbname” based on a condition that another variable “Y” has the value “2” using the statement “Select Xfrom dbname where Y = 2”. More complicated statements are certainly possible.

SQL is the primary language for querying a structured relational database. This is due to the structured nature of the database. A relational database is sometimes referred to as a SQL database. A nonstructured database would use NoSql and the database is sometimes referred to as a NoSQL database.4 See Lemahieu et al. [2018] for discussion of SQL and NoSQL databases.

< Prev   CONTENTS   Source   Next >