Creating a searchable database
Processing text data and identifying key words, phrases, and topic is only the beginning. What do you do with this processed text data? The objective for the processing methods I outlined above is to enable you to take a large quantity of text data from online customer reviews and distill essential messages useful for developing new product ideas. Unfortunately, someone still must go through the distilled data since ideas will not just “pop out” of them. In addition, the design team may be, and probably will be, large and diverse so that there is not one designer or even one design department. A way is needed to disseminate the distilled text data so that a wide, collaborative group of people can access the text data and use it for their purposes. These purposes could be to identify new ideas and set design parameters.
FIGURE 2.12 This set of graphs highlights specific points for the SKD of DTM using TFIDF weights.
A data base structure is needed to link the original text data and the distilled key words, phrases, and topics which would allow users (i.e., designers) to develop queries with returned results linked back to the original text. JMP software, for example, allows you to analyze text data as described above but then pick a key word or phrase and jump to the original document or documents containing that word or phrases. This allows you to scan those documents to better understand the context of what customers said and to interpret their messages for ideas and design parameters. A general system such as this is needed to handle large volumes of text data by the diverse groups. A possible flow chart schema is illustrated in Figure 2.15.
A central feature of the schema in Figure 2.15 is the link from the extracted words and phrases from the corpus back to the corpus itself. The extracted words and phrases provide insight and Rich Information about customer needs, problems, and issues. Their real value, however, can only come from the context they are drawn from which is the documents in the corpus. There must be a link from the corpus to the key words and phrases. Lemahieu et al.  outline an approach for indexing documents that they call an inverted index. Basically, once a word (the same holds for phrases) is extracted, based on the methods discussed above, an index is created that consists of a keyrvalue pair: the key is the word and the value is a list of document pointers. The document pointer list contains all the documents that contain the word or term. This can be created from the DFM which itself contains
FIGURE 2.13 This set of graphs highlights specific points for the SVD of DTM using TFIDF weights.
FIGURE 2.14 Five topics with term loadings.
markers for all the documents each word comes from. Lemahieu et al.  also note that the list pointers are of the form (dtJ, Wy) for term f, where dtJ is the j'h document containing term t, and Wy is the importance weight associated with that
FIGURE 2.15 The results of a text analysis can be made searchable with a system that links words back to their originating documents.
term. This weight would be the tfidf. I discussed the DTM and tfidf weights in 2.3.3.
Product designers could query this full-text search engine for key words (and phrases) and not access the documents per se until they feel they have found some problems and suggestions from the reviews based on the key words. Those problems and suggestions, of course, must satisfy design constraints such as what is currently technologically feasible. Once they feel they have some ideas, they could then link back to the corpus to investigate the context for the words they searched for further clarification. This process is not linear so that the search for words and the linking back to the corpus would probably happen at the same time. This issue is whether or not the full-text search engine will allow them to do this.
Lemahieu et al.  note that many existing full-text search engines provide this capability plus other features. Some they cite are:
- • a thesaurus;
- • a proximity measure so that documents are returned in which the terms are close together;
- • fuzzy logic to allow for misspellings and different ways to spell the same word; and
- • advanced text mining techniques.
See Lemahieu et al. |2018| for further discussion of these points.