Message testing

Once the initial messages are developed, they must then be tested with customers to uncover what works and what does not work. There are two quantitative ways to do this:

  • 1. MaxDifF message testing; and
  • 2. A/В message testing.

The messages could be qualitatively tested through focus groups but this has drawbacks. Focus groups require that key marketing personnel attend sessions to hear first-hand what customers have to say. They do not have to be physically present because of modern communication technology which allows them to remotely view sessions. Regardless if they view the sessions onsite or remotely, their time and attention must still be devoted to them, and this is time taken away from other business activities. In addition to this time cost, there are sample size and sample representation issues. Sample sizes are necessarily small making them unrepresentative of the market. As a result, it is difficult, if not impossible, to generalize findings to the entire market. See Vicsek [2010] for some discussion.

The two quantitative methods are discussed in the next subsections. The two subsections assume an initial set of messages, but this is for convenience. As I noted, creation and analysis are iterative.

MaxDiff analysis of messages/claims

Typically, the team responsible for message creation will develop several candidate messages. There is no rule or rule-of-thumb (ROT) about the size of the candidate set, but my experience has shown that an initial set of a minimum of five is sufficient. I have worked with clients, especially in the pharmaceutical industry, where 50-60 messages would be developed, each varying by a simple word. The reason for this large number is the sensitivity of customers (as well as physicians, regulators, and lawyers) to what is said about a drug. So the team has to be cautious regarding what is claimed.

The message analysis discussed here deals with the problem of identifying the best promotional message, claim, or slogan for a new product although the methodology could also be applied to messages at the corporate level. I restrict myself, however, to only the marketing function. The “best” is determined by customer ratings of proposed messages. There are many ways or dimensions to specify what a customer has to rate. I will confine myself to a ca4-to-action (СТА) at first but then expand to other dimensions later.

А СТА statement depends on the product and the type of action required. For instance, any of the following are viable candidate СТА statements: [1]

  • • Likelihood to prescribe;
  • • Likelihood to use; and
  • • Likelihood to ask or learn about.

СТА statements are usually rated on a Likert Scale, typically five points. A problem with a Likert Scale is that it is not clear whether or not customers use the whole scale (i.e., ah five points) to evaluate the СТА or only a portion (e.g., they confine themselves to the top portion only) for ah items they evaluate and regardless of whether or not their evaluation is relevant. This means results could be biased.

Suppose a marketing team developed seven potential advertising messages for a new product. It is certainly not practical to use all seven. Which is best? One way to determine the best is to survey customers and ask them to rate how likely they are to buy the product based on each message. This is а СТА statement. The issue right now is how to present the messages for evaluation. One way is to present them sequentially. This is inefficient and suboptimal since customers may respond differently to one message if they know a prior message may be better or worse. Alternatively, all seven messages could be presented at once and the customers could be asked to rank or sort them for their effectiveness in motivating them to buy. This could work if the number of messages is small, such as the seven in this example. Suppose, however, there are 60 as for the pharmaceutical example. A ranking task is clearly impractical. A compromise is to present a set of messages consisting of a minimum of two but being less than the full array of messages. This use of a set, called a choice set, is the basis for the message analysis I will describe. The use of a choice set is fundamental to an approach called Maximum Differential Choice Analysis, or MaxDiff for short. The concept of a choice set is the same as the one I described in Chapter 4 for discrete choice analysis. In fact, MaxDiff is in the same family of choice models as conjoint and discrete choice. It differs from conjoint and discrete choice analysis in that the items (messages in this case) are not varied by changing a level of an attribute; there are no attributes, per se. I will next describe the construction of the choice set and the approach. See Paczkowski [2018] for a discussion of the choice family.

Choice set construction

Each customer in a survey is shown a set of messages, not a single message, one set at a time. The sets are called choice sets because each customer is asked to select messages from each set. The sets are created using statistical design principles and procedures that lead to a design matrix. The design matrix is an array of rows and columns where the rows are the choice sets and the columns are the design elements (i.e., messages) comprising the choice sets. There are many design procedures that lead to a design matrix, some of which are discussed in Paczkowski [2018].

For most design procedures, the full set of design elements is represented in each row of the matrix but only in different arrangements. The design is how they are arranged. The elements could be discrete factors with levels, the minimum being two, such as “High” or “Low”, “Present” or “Absent”, “Red” or “Green”.

TABLE 5.1 Full factorial design matrix for two discrete factors, each at two levels. The first column is the first factor and the second is the second factor. With only two factors at two levels each, there are only 4 (= 2 X 2) possible arrangements so there are only four rows to the matrix. Each row is an arrangement of the levels.

Low

Low

Low

High

High

Low

High

High

Suppose a simple case of two discrete factors, each at two levels: “High” and “Low”. A design matrix showing all arrangements is show in Table 5.1.

Small design matrices such as this are easy to create and use with customers. Each row is a choice set so they are shown four sets. For each set, a customer has to state their preference for the first or second factor when each is set at the level specified in the matrix. For our example problem, however, which consists of seven messages, you run into a problem because customers have to judge seven elements in a set. This is far too many. It is better to have a subset of the seven in each row of the matrix, but yet with all seven still appearing equally in the entire matrix to ensure fairness (i.e., balance) to each message. A design called a Balanced Incomplete Block Design (BIBD) could be used.

A BIBD is a design procedure for creating a design matrix such that each factor measured at one level appears an equal number of times in the matrix and each row of the matrix has fewer than the total number of factors. For consumers studies, a rule-of-thumb (ROT) is that 2-5 factors should appear in each row with four being a typical number.3 For our example, you could have seven messages with each choice set containing four. Such a design matrix is shown in Table 5.2. See Paczkowski [2018J for the ROT.

This arrangement implies that 28 messages (=7x4) are shown in total. Since the example started with only seven messages, each message is obviously replicated

TABLE 5.2 BIBD design matrix example for seven messages in seven sets, each set with four messages. Each row is a choice set. Notice that each of the seven messages repeats four times throughout the design. Also, notice that each pair of messages (e.g., Message 6 and Message 3) appears the same number of times throughout the matrix: twice.

Message 6

Message 3

Message 7

Message 1

Message 1

Message 2

Message 5

Message 7

Message 5

Message 7

Message 4

Message 3

Message 7

Message 4

Message 6

Message 2

Message 2

Message 1

Message 3

Message 4

Message 4

Message 5

Message 1

Message 6

Message 3

Message 6

Message 2

Message 5

four times enhancing their exposure. The arrangement of the messages into the seven sets of four each is the design matrix, or simply the design.

A BIBD design must meet three requirements:

  • 1. Each element must appear the same number of times as ever)' other element.
  • 2. Each pair of elements must appear the same number of times throughout the matrix.
  • 3. An element cannot be duplicated in a single row of the matrix.

These conditions can be verified using an incidence matrix which shows the locations of each message. An incidence matrix corresponding to Table 5.2 is shown in Table 5.3. A pairwise matrix showing how many times a message pairs with another message is also helpful for assessing balance. A pairwise matrix for Table 5.2 is shown in Table 5.4

Six parameters define a BIBD design matrix:

TABLE 5.3 BIBD incidence matrix for Table 5.2. A “1” indicates that the message in the column header appears in that slot; “0” indicates that it does not appear. A quick perusal shows that each message appears four times in each column.

Block

Message 1

Message 2

.Message 3

Message 4

Message 5

Message 6

Message 7

1

1

0

1

0

0

1

1

2

1

1

0

0

1

0

1

3

0

0

1

1

1

0

1

4

0

1

0

1

0

1

1

5

1

1

1

1

0

0

0

6

1

0

0

1

1

1

0

7

0

1

1

0

1

1

0

TABLE 5.4 BIBD pairwise matrix for Table 5.2. The matrix is obviously a square matrix that is symmetric along the main diagonal. Only the upper triangle is shown here. The main diagonal shows the number of times each message occurs and the diagonal cell entries are the respective column sums of the incidence matrix, Table 5.3. Also note that the off-diagonal cell entries are all equal.

Message

Message 1

Message 2

Message 3

Message 4

Message 5

Message 6

.Message 7

Message 1

4

2

2

2

2

2

2

Message 2

4

2

2

2

2

2

Message 3

4

2

2

2

2

Message 4

4

2

2

2

Message 5

4

2

2

Message 6

4

2

Message 7

4

TABLE 5.5 This is a typical choice set presented to a customer. For our example problem, this is the first choice set in Table 5.2. All seven choice sets in Table 5.2 are shown to each customer.

Select the message yon prefer the most mid the one yon prefer the least to motivate yon to buy the product. Please select only two messages.

Message 6

Message 3

Message 7

Message 1

Most Preferred

Least Preferred

  • 1. t = number of messages (also called treatments);
  • 2. b = number of rows or blocks or choice sets in the final matrix;
  • 3. к = number of columns in the final matrix;
  • 4. n = total number of messages shown = bxk;
  • 5. r = number of times each message repeats;
  • 6. X = number of pairs of messages.

A shorthand notation for a BIBD with these parameters is BIBD(t,b,r,kX). The n is not needed in the notation since it is defined by other parameters. For the example in Table 5.2, you have BIBD(t = 7, b = 7, r = 4, к = 4; X = 2).

The construction of such a design matrix is not trivial. In fact, there are many situations in which a design matrix is not possible. In these cases, you have two options:

  • 1. settle for a non-optimal design; or
  • 2. change the number of messages until you get a BIBD.

A non-optimal design may not be terrible to work with and so should not be discounted. Changing the number of messages, especially reducing the number, may be an issue because the creative team and management have to approve the change.

Data collection

Each customer in a survey is shown a set of messages one set at a time. A possible choice set presentation is shown in Table 5.5. A customer is asked to select the message they most prefer and least prefer as motivating them to buy the product. This is the СТА. Since the most and least preferred messages are selected, the issue of how they interpreted a rating scale is eliminated.

Estimation

Since each customer is asked to make a selection from each choice set, the problem becomes a choice problem similar to the conjoint and discrete choice problems. In fact, MaxDiff is akin to the discrete choice so the estimation methods for discrete choice can be used, although with slight data coding changes usually handled automatically by software. Nonetheless, the estimation is the same. This means the

MaxDiff result is a set of estimated utilities, one for each message. These utilities are usually scaled to lie between 0 and 1 and sum to 1.0 so they can, therefore, be interpreted as probabilities. See Paczkowski [2018] about utility scaling.

Case study

A pharmaceutical company has a new allerg)' medication nearing market launch. The marketing and advertising teams developed seven messages to test before a marketing campaign is launched. The messages are:

  • 1. FDA approved.
  • 2. Take just once per day.
  • 3. No side effects.
  • 4. Available over the counter.
  • 5. Noticeable results in 12 hours.
  • 6. Available in pill or liquid form.
  • 7. Requires taking with food.

A BIBD(t = 7, ft = 7,r = 4, к = 4; A = 2) was created. This design matrix is the one in Table 5.2. A panel of consumers known to have the allergy' the medication targets was recruited for an online survey. The questionnaire contained modules on the consumers’ general health (e.g., how long they had the allergy; its severity; medications currently taken) and routine demographics. The new medication was described in a module in the middle of the questionnaire. The consumers were then asked the MaxDiff questions. They were shown the seven choice sets and for each one they were asked to select the message that motivates them the most and the least to buy the new medication for their problem.

There were two responses for each choice set: the most preferred and the least preferred. Since there are seven choice sets, there were 14 responses per respondent. This is a respondent’s response pattern. The data for each respondent were recorded in the order Best/Worst for the first choice set, Best/Worst for the second choice set, and so on. The values recorded were the numbers, 1,..., 7, of the selected messages. The estimated utilities and scaled utilities, which are interpreted as probabilities or take rates, are shown in Figure 5.4.

The estimated take rates indicate that consumers prefer the message “Take just once a day” whereas “Requires taking with food” ranks last. The convenience of taking the medication just once per day is important and motivates them to buy the medication. The food requirement, however, is a deterrent to buying.

Extending the framework

While the message analysis framework described in the previous section will work, it may still be insufficient, not because of the approach but because it may be too restrictive. Consumers evaluate messages or claims on multiple dimensions, the importance of the dimensions varying among consumers. I will now discuss a multidimensional adjustment to this message analysis framework.

The MaxDiff utility estimates and scaled utilities are shown here

FIGURE 5.4 The MaxDiff utility estimates and scaled utilities are shown here. The scaled utilities lie between 0 and 1 and sum to 1.0 so they can be interpreted as probabilities or take rates. The bar chart shows the messages ranked by their take rates. There is some differentiation among these messages.

The issue for message assessment is the effectiveness of a message on a business metric such as sales. There are two ways to view effectiveness: perceived effectiveness before launch of the message and actual effectiveness after launch of the message. Perceived effectiveness addresses the question “ Will the message work?” while actual effectiveness addresses the question “Did the message work?”. For perceived effectiveness, the only way to determine whether or not a message will work is by asking customers to assess the message quality. In a simplistic manner, this was done with the MaxDiff procedure. It was simplistic because quality assessment was one-dimensional: people either liked or disliked the message, or it motivated

This illustrates an enhanced view of the key drivers for assessing message effectiveness, both perceived and actual by incorporating impact and attitude measures

FIGURE 5.5 This illustrates an enhanced view of the key drivers for assessing message effectiveness, both perceived and actual by incorporating impact and attitude measures.

them or it did not motivate them, and so forth. Assessment, however, is not onedimensional but multidimensional, with the message simultaneously appealing to different psychological factors. People could assess a message on its persuasiveness, its logic, its believability, just to name a few factors.

In general, assessment factors could be grouped into two categories: impact measures and attitude measures. See Dillard et al. [2007b] for a discussion. Impact measures assess the effect a message will have on taking an action such as purchasing a product; that is, the СТА. They “shape” opinion or judgment as noted by Dillard et al. [2007b], Is the message persuasive, compelling, believable, or convincing enough that the individual will buy the product? Attitude measures assess the judgment or acceptance of the message. Is the message plausible, sound, logical, or novel enough that the individual will pay attention to the message and possibly then buy the product? The outcome of both impact and attitude measures is the same - product purchase - but the path is different. In both cases, the individual will perceive the message to be effective and therefore behave a certain way - buy. The connection between the impact and attitude measures and effectiveness is illustrated in Figure 5.6.

Table 5.6 shows a list of potential impact measures while Table 5.7 shows a list of potential attitude measures. These were gleaned from Dillard et al. [2007a] and Dillard et al. [2007b],

This illustrates the key drivers for assessing message effectiveness, both perceived and actual

FIGURE 5.6 This illustrates the key drivers for assessing message effectiveness, both perceived and actual.

TABLE 5.6 Sample list of descriptors that could be used to measure the impact of a message.

Persuasive

Effective

Compelling

Convincing

Desirable

TABLE 5.7 Sample list of descriptors that could be used to measure the attitude effect of a message. There is some overlap with the descriptors in Table 5.6.

Persuasive

Believable

Logical

Plausible

Sound

Motivating

Unique

Familiar

Memorable

Novel

Favorable

Friendly

Necessary

Good

Beneficial

Combinations of impact and attitude measures are sometimes used. Raghavarao et al. [20111 mention a study of Pennsylvania mushrooms involving nine message questions - which they refer to as brand concepts - and three scale questions: purchase intent, uniqueness, and believability. The purchase intent is the СТА while the uniqueness and believability are the attitude measures. Raghavarao et al. [2011] analyze only the purchase intent data using a method due to Landis and Koch [1977] that involves cumulative response proportions calculated from the responses on a five-point Likert scale ranging from 1 = Very Unlikely to Purchase to 5 = Very Likely to Purchase. The cumulative proportions are calculated for each brand concept although only four are maintained since the last is always 1.0. Let pl( be

the i'1' cumulative proportion for the f' concept, i= 1,___,4 and j= 1,...,9. The

cumulative proportions are converted to logit values: logit)j = In (#v/i . Using

appropriate dummy coding for the concepts, these logit values are modeled as a discrete choice model. See Raghavarao et al. [2011] for details.4

Another approach uses the MaxDiff framework. Suppose there are three measures as for the mushroom example: purchase intent, uniqueness, and believability. A customer is presented with a choice set of messages based on a BIBD and is asked to select the most preferred and least preferred message as before, but this time for each measure: once for purchase intent, once for uniqueness, and once for believability. If there are seven choice sets, a customer is asked to make 21 (= 3 measures X 7 sets) choices. Experience has shown that this task actually proceeds quickly. I have found it works for choice sets of seven to nine sets and up to five measures. The final data set has two columns (best followed by worst) for each set in the design times the number of measures. With seven choice sets and three measures, there are 2 X 7 X 3 = 42 columns of data, preferably with the first 14 for the first measure, the next 14 for the second measure, and the last 14 for the third measure. Take rates are estimated for each measure as described above. For our example, there are three sets of take rates with each set covering the seven messages. These could be arranged in one data table that has seven rows and three columns.

There are two ways to analyze the data table of take rates. The first is to plot the take rates for each column. The most common plot is a side-by-side bar chart. A better analysis is to recognize that the three measures are chosen to reflect the perceived effectiveness of the messages. One overall perceived effectiveness index (PEI)

could be derived as the weighted average of the three measures, one weighted average for each message and one weight for each measure. The weights should sum to 1.0. This is a simple row weighted average for each message in the data table. The issue is the set of weights. There are four ways to derive them:

  • 1. assign varying weights based on judgement;
  • 2. assign a constant weight;
  • 3. calculate the range of the take rates for each measure and then divide each range by the sum of the ranges; and
  • 4. use the first principal component loading of the take rate table.

The first is obviously highly subjective and subject to challenge, especially if one person chooses the weights. A team of SMEs and KOLs could always decide on the weights, but this has the problem of assembling the right team. The second approach is a simple arithmetic average and is not insightful. The third is like the conjoint attribute importance method I described in Chapter 3. This has an intuitive appeal and is easily understood by management.

The principal component solution, outlined in the Appendix to Chapter 3, is more complicated because it involves knowing about principal components analysis (PCA), how to apply it, and how to interpret and use the results. Also, the method may be restricted by the number of messages relative to the number of observations. It is generally unclear what sample size should be used with PCA. Osborne [2004] and Shaukat et al. [2016] note that practitioners are divided between a sample size recommendation and a recommendation based on the ratio of sample size to number of items to use in the analysis, although most seem to gravitate to a ratio estimate. Osborne [2004] and Shaukat et al. [2016] note that ratio recommendations of 5:1 and 10:1 are common, although the latter due to Nunnally [1978] seems most commonly cited. See especially Shaukat et al. [2016] for citations on a number of recommendations. The focus on sample-item ratios is interesting since sampling theory per se does not concern itself with this ratio; the sample size to achieve a pre-specified precision, perhaps adjusted for the cost of collecting a sample, is the only issue. See Cochrane [1963] and Levy and Lemeshow [2008] for a good overview of sampling methodologies. The issue of sample size (or at least relative to the number of items) is generally important because, as noted by Osborne [2004], a large sample size tends “to minimize the probability of errors, maximize the accuracy of population estimates, and increase the generalizability of... results.” For PCA with little or no guidance for sample size, Osborne [2004] further notes that overfitting can result which in turn results in “erroneous conclusions in several ways, including the extraction of erroneous factors or mis-assignment of items to factors.”

This sample size issue (or a ratio issue) is important for using PC4 with a multidimensional MaxDiff analysis of messages because the “sample” is actually the number of messages and the “items” are the dimensions. If there are 12 messages tested and three dimensions (e.g., purchase intent, uniqueness, and believability),

This PEI summary map shows the messages ranked ordered from “ I ” being highest ranked to “7” being lowest ranked for each measure

FIGURE 5.7 This PEI summary map shows the messages ranked ordered from “ I ” being highest ranked to “7” being lowest ranked for each measure. The whole table is sorted in descending order by the PEI value.

then the ratio of sample to items is only 4:1 which may be insufficient for good results.

I recommend the third method because of its simplicity and understandability.

Once the PEI is calculated for each message, they can be ranked in descending order by their PEI score. A bar chart could display the ranking. This shows which message is the overall “winner” in perceived effectiveness. The PEI scores could also be added to the take rate table as an additional column and the whole table sorted by the PEI score. A heatmap-like table could then be developed with the measures as rows and the PEI as columns. The table should be sorted in descending order by the PEI. It is helpful to know the rank order of each message on each measure so the cells of the table could contain the rank value of a message for a measure. An example is shown in Table 5.7 for a case of seven messages and three measures such as in the mushroom study. The advantage of this PEI summary table is that the best message is clearly indicated and the reasons for its rank are evident.

There is one further analysis that could be done based on the MaxDiff approach. The model I just described was estimated on an aggregate basis. This means the estimated utilities can be interpreted as averages of all the customers in the sample. Another way to estimate the utilities is at the individual level; that is, one set of utilities per customer in the sample. This estimation is obviously more complex. See Paczkowski [2016] and especially Paczkowski [2018]. One advantage of estimating at the individual level is that the resulting estimated utilities can be used in further analyses. One form of analysis is TURF which I described in Chapter

4. Paczkowski [2018] provides a detailed example of using estimated utilities in a TURF analysis.

I outlined a procedure for message testing that involved querying customers about different emotional aspects of a message. I referred to these as dimensions which could be, for example, believability, desirability, and memorability. A weighted average of the average utility scores, where the utilities were from MaxDiff estimations, could be calculated and this weighted score used to rank the messages. The advantage of this approach is that you can identify the drivers for the highest ranked message. For example, the highest ranked message based on the weighted average may also rank highest on believability and memorability.

Although useful, this procedure does not tell you why the message ranked high on, say, believability and memorability. What drove or motivated customers to rate these two dimensions high and, presumably, other dimensions low? Also, what motivated them to rate other messages the way they did? More information is needed to dissect the ratings.

A possible approach is to use demographic information along with other ratings to estimate a model of the mean utility scores for the messages. The demographic data are usually collected in a survey so they should be available. This data tells you about the nature of the people that may have driven them to rate the messages as they did. A simple approach is to profile the respondents by their individual ratings. Another is to estimate a MaxDiff model at the individual level using a Hierarchical Bayes estimation procedure and then profile the customers and their responses. Data visualization tools (e.g., boxplots) could be used to study the distribution of utilities by message and demographic characteristic. Another approach is to use a decision tree methodology with the utility scores as the dependent variable and the demographic variables as the independent variables.

A better approach, however, is to model the ratings as a function of the demographics and features or characteristics of the messages, characteristics that transcend the dimensions such as Believability, Desirability, and Memorability. These characteristics could be:

  • • terseness of the wording;
  • • average length of the message wording;
  • • tone (e.g., harsh, friendly, loving, threatening);
  • • style (e.g., simple, complicated);
  • • vocabulary;
  • • directness (e.g., blunt and to the point); and
  • • clarity and conciseness.

As noted by Sanders [1984|:

The way in which messages are styled can amplify, dampen, or entirely cancel the public reactions of respondents to communicated information. Certain options of phrasing and syntax have this impact by constraining what can follow in the unfolding text or transaction with minimal risk of misinterpretation, and without undesired inductions about the character and traits of the respondent. Such stylistic options are a resource for strategic communication when conventions and protocols for structuring discourse do not apply or are rejected.

Also see Dillard [2014] on styles.

To learn about the effect of the message characteristics, you could ask the customers when they take the MaxDiff survey to rate each message on a list of characteristics such as the ones above. These characteristic ratings and the demographics could be used as independent variables in a regression model. In this case, however, there are multiple dependent variables; that is, each message dimension rating (e.g., on believability, desirability, and memorability) is a dependent variable. There would thus be a set of dependent variables. The goal is to uncover the relationship between a set of independent variables and a set of dependent variables.

This illustrates a data format for using a PLS regression for estimating the effects of message characteristics ratings on message utility ratings

FIGURE 5.8 This illustrates a data format for using a PLS regression for estimating the effects of message characteristics ratings on message utility ratings. If there are N messages and m dimensions measured for each message (e.g., Believability, Desirability, and Memorability for m = 3), then the Y matrix has size Nxm. The cells of the matrix are the mean utility' ratings. If there are p characteristics for the messages, then the X matrix has size Nxp. If demographics are included, then the mean utilities and characteristic ratings have to be grouped by' logical combinations of the demographic variables which increases the N. For example, if income, measured as low/medium/high, and gender are included then there are six logical combinations. Mean utilities and ratings are grouped by these six combinations.

This is the structure for a partial least squares regression as outlined in the Appendix to Chapter 3. A data structure is illustrated in Figure 5.8.

This methodology could be used to uncover the key drivers for the messages while the MaxDiff approach outlined above reveals the top ranked message.

A/В digital testing

In the pre-Internet era, billboards, TV and radio spots, newspaper and magazine ads, and direct mail were the only means for promoting a message. Since the advent of the Internet, online digital ads are more dominant. This new form of promotion has introduced a new way to test messages online. This is A/В testing. The “A” and “B” refer to two variants of a message.

What is A/В testing?

А /В testing is a method to test:

  • • website landing pages;
  • • advertising and email messages;
  • • promotional offers;
  • • calls-to-action; and
  • • price points to mention a few uses. The aim is to determine the impact on a key business metric, usually sales (click rates are also possible). In its simplest form, there are two steps:
    • 1. Randomly assign each online store visitor to one of the two web pages, each with a variant of a message. Record the visitor’s action. The visitor either made a purchase (i.e., converted) or not.
    • 2. Arrange the data in a 2 X 2 table and then do a statistical test on the table to determine if there is a statistical difference in the metric, say sales.

The table might look like the one in Table 5.8. The number of “hits” is the number of consumers who visited the web page (the “visitor”) and were exposed to a message. The conversion rate is the percent of exposed consumers who purchased the product because of the message. If /», is the number of hits for message i, i = A, B, and Bll)l is the number of the hi consumers who purchased, then the conversion rate is

As an example, suppose you conduct an online study on your website for two weeks by randomly showing one message, A, to one group of visitors and another message, B, to another group of visitors during the same period. Suppose the total hits in the two-week period is 16,188. A summary table is shown in Table 5.9.

You can conduct a statistical test to determine if the proportion of online buyers is statistically the same for the two messages. If pA is the proportion of web site visitors who convert who saw message A and pB is the proportion given message

TABLE 5.8 This is a generic setup for an A/В test. The values /»-,» = 1,2;j = 1,2 are the number of customers who saw the message for the respective row and column headers. The conversion rate is the percent of consumers who purchased (assuming a hit is equivalent to a consumer), given that they saw the message in that row margin so it is a marginal quantity.

Message

Hits

Buy

Not Buy

Conversion Rate

A

В

TABLE 5.9 Example A/В data.

Message

Hits

Buy

Not Buy

Conversion Rate

A

7,518

234

7,284

3.1%

В

8,670

504

8,166

5.8%

В, then the Null Hypothesis is

vs. the alternative that the proportions differ.3 The proportion p is generally low. The statistical test is a chi-square test of significance. There are two possible tests: the Pearson and the Likelihood-ratio Chi-square Tests. These are reviewed in Appendix 3.A.

The two chi-square tests were conducted for the data summarized in Table 5.9. The results are shown in Figure 5.9 and Figure 5.10. The mosaic chart in Figure 5.9 simply shows that the overwhelming majority of site visitors did not buy so the

Contingency table summary

FIGURE 5.9 Contingency table summary.

Chi-square test results

FIGURE 5.10 Chi-square test results.

conversion rate was low. The contingency table at the bottom of the figure shows the relevant proportions. Figure 5.10 shows the two chi-square test results. From Figure 5.10, the Pearson Chi-square is 67.493 and the Likelihood-ratio Chi-square is 69.440. These are very large. For both tests, the p-values are below or = 0.05, a traditional significant level, so the Null Hypothesis of equality is clearly rejected.

The test results in Figure 5.10 are based on the calculations in Appendix 3.A. This is standard in most elementary statistics textbooks. Another, more advanced approach is to use logistic regression to estimate a model fitting the Buy/Not Buy variable to the message indicator. Since the Buy /Not Buy variable is nominal with only two levels (Buy and Not Buy), OLS cannot be used to estimate parameters for three reasons:

  • 1. OLS can predict any range of values, but this class of problems has only two, such as Buy and Not Buy;
  • 2. OLS has a normally distributed disturbance term but this class of problems has a Bernoulli distribution; and
  • 3. OLS has a constant variance under the Classical Assumptions for regression analysis but this class of problems has a nonconstant variance. See Gujarati [2003] for a review of the Classical Assumptions.

To see the Bernoulli disturbance for this problem, assume a linear in the parameters model is specified as

where Y, has dummy coded values either 0 or 1. This is called a Linear Probability Model, or LPM. It assumes that £(e() = 0 so that £(Y[) = + /?, X(. Now let

Therefore,

This implies that

so the mean (a linear function of X) must lie between 0 and 1, hence the name “linear probability model.”

If the model is still У, = Д, + /?, X, + e,-, then for Yj = 1, or

with probability p, since Pr( Y, = 1) = pr For Y, = 0,

with probability 1 The disturbance term can have only two values: 1 - Д, - /?1 X, with probability p, and -Д, -/?, X, with probability 1 -p, which means it is a Bernoulli, not a normal, random variable. This is not an issue since for large samples a Bernoulli random variable approaches a normally distributed random variable because of the Central Limit Theorem. See Gujarati |2003] for a discussion.

Now consider the variance of the disturbance given by

where the second line follows from £(e() = 0 and the fourth line follows from £(Y() = fi0 + /?,X( = p(. The variance changes as X, changes so the disturbance is heteroskedastic. This is not too bad since weighted least squares can always be used to make an adjustment. See Gujarati [2003] for a discussion.

A

The big issue is that the estimated value of Y,, Yh may not lie in the range

A

0 < Yj < 1 so you may predict something that cannot physically happen. You need a new variable or a transformation of the dependent variable to ensure the right magnitudes. That is, you need a probability model with

A cumulative distribution function, or CDF, defined as I>r(Xl < xf) will work. A model based on a CDF is

This is a logistic distribution CDF. For A/В testing, the X, is message A and message В and У, is “Buy” and “Not Buy.”

What happens when Z,- becomes large or small? Note that

If Z, —> +oo, then e~'/' = 0 and p, = 1. Similarly, if Z, —► -oo, then p, = 0.

(X, f e'A

Finally, note that --— + 11- --— 1 = 1 so the probabilities add correctly.

This model can be given an economic interpretation as

The numerator represents the influence of the independent variables and thus represents “choice”. The “1” in the denominator represents “no choice”. The factor 1 + ez represents the total choice option. The model is a choice model in the family of conjoint, discrete choice, and MaxDiff models but in this case the choice set has only two options.

The model can be written as

The ratio ft/i -p, = ez> is the odds of choosing optionj from the choice set of two options. Odds come from sporting events and show the likelihood of an event happening. The odds of 3 : 1 means the event is 3x more likely to happen than not. The formula for any odds of an event happening is

TABLE 5.10 This is a summary of the relationship between probabilities and odds.

Probability of Event

Odds of Event

0-0.5

0- 1

0.5 - 1

1 - 00

Table 5.10 shows the relationship between probabilities and odds.

Taking the natural log of both sides of the odds, you get the “log odds”, or

where L is called the log odds or logit.6 This is a logistic regression model, a member of a large regression family. The unknown parameter, ffk,k = 0,1 of the choice probability can be estimated using maximum likelihood (ML).

Although you can estimate the logit parameters, ffk,k = 0,1, you would have difficulty interpreting them. The parameter /?, shows the change in the log odds when its associated variable’s value changes by one unit. This was not hard to understand for OLS because it is concerned with the change in Y for a change in X - the marginal effect. Now you have the change in the log odds. What does a change in log odds mean?

Assume a simple one-variable model for buying a product where X is discrete. Let X represent gender with the dummy coding: females = 0, males = 1. Then the log odds for females is

and the log odds for males is

Exponentiating both log odds and forming the ratio of males to females gives

The exponentiation of /?, is the odds of males buying the product to the odds of females buying the product. If the odds ratio is, say, 3, then the likelihood of males buying the product is 3x greater than the likelihood of females buying it.

The exponentiation rule is correct if dummy or indicator variable coding is used for the independent variable in estimation. Indicator coding uses “0” and “1” values for the coding where “0” represents the base. Recall that effects coding uses “-1” and “1” values where “-1” represents the base. The two coding schemes provide the same results, only the interpretations are different. Dummy coding shows the movement or difference from a base level while effects coding shows deviation from an overall mean. See Paczkowski [2018] for a thorough discussion of the two schemes. For the case considered here, the parameter estimate associated with indicator coding turns out to be twice the value of the parameter value based on effects coding. This means that if dummy coding is used, then the odds ratio is the exponentiated parameter; if effects coding is used, then it is the exponentiated

parameter times 2 since ——— = c~xl [. That is,

The odds ratio also shows the degree of association between two variables, much like the correlation coefficient. This is summarized in Table 5.11.

A logit model was estimated for the Bu y/Not Buy variable as a function of the message shown. The results are shown in Figure 5.11. The estimated parameter under effects coding is -0.3264781 while under indicator or dummy coding it is -0.652956. Notice that the indicator estimate is twice the effects estimate. The odds ratio using either (with the effects parameter multiplied by 2) is 0.52. This is interpreted as the odds of buying (“Yes”) versus not buying (“No”) when the message is “A” versus when it is “B.” This means that someone is only half as likely to buy when the message is “A” as when it is “B”. Notice also that '/o.52 = 1.92 which is the odds of buying when the message is “B” than when it is “A”; just the inverse. So someone is almost twice as likely to buy when message “B” is used rather than “A”. This 1.92 odds ratio value is the same one shown in Figure 5.10.

TABLE 5.11 Odds ratio and association.

Odds Ratio Value

Interpretation

Greater than 1.0 (upper bound is infinity)

Positive association. The larger the value, the stronger the positive association.

Equal to 1.0

No association. The two variables are independent of one another.

Less than 1.0 (lower bound is 0)

Negative association. The smaller the value, the stronger the negative association.

The 95% confidence intervals are based on

where the standard error of the log of the odds ratio is

where “a”, “b”, “c”, “d” are the frequency counts in the four cells of the contingency table. For our example, a = 7284, b = 234, c = 8166, and d = 504. The square root of the sum of the inverses is 0.080730. Therefore, the upper limit for the 95% confidence interval is nl-921212+1-96x0'080730 = 2.25 0 57 as shown in Figure 5.10 and Figure 5.11. The same calculation holds for the lower limit except for the use of a negative sign. See Agresti [2002] for a discussion of the confidence interval calculation.

Experimental designs

An experimental design was used for conjoint, discrete choice, and MaxDiff studies. One should also be used here. What is the design? There are two possibilities, both involving a random assignment to each message but with a twist. For both possibilities, a visitor is randomly assigned to one of two messages. Suppose, however, that a visitor returns to the website as they might for an online store. For the first approach, a record is kept of their visits and so if they were randomly assigned to message “A” on the first visit, then they are assigned to “A” on each subsequent visit. For the second approach, a visitor is randomly assigned to one of the two messages regardless of whether it is their first, second, or tenth visit. Each visit is a random assignment. It is not clear which is preferable. The first requires more record keeping and will then be more costly and onerous. It will, however, ensure that the visitor is not inconvenienced giving cause to become annoyed or angry at seeing different messages each time they visit the website.7

5.1.4 Message delivery

Messages have to be delivered, deployed, dispersed, or spread through the market to be effective. Basically, you have to “get the word out.” At one time, this was easy.

These are the logit model fit results

FIGURE 5.11 These are the logit model fit results.

Billboards, large print ads, TV and radio spots were the only forms of message dissemination. There was also the time-honored method of word-of-mouth (WOM). Basically, one person told something to someone who in turn told someone else and so on. This type of message dissemination was inherently dyadic: involving a relationship between two people. It works well in small, local markets but is more difficult to use and justify in large dispersed markets. In addition, the wider the market, the higher the probability the original message would become distorted and degraded.

In our modern, high-tech environment, these forms have diminished in importance, although they are certainly still used. They have been either replaced at worst or subordinated at best to more digital and social media methods. The traditional forms are less important to today’s consumers who rely more on electronic, digital forms of communications. Most likely, this is due to the wider array of messaging channels such as chat rooms, emails, newsgroups, social media platforms, and so forth. See Christianson et al. [2008).

The various forms of social media are networks with complex interconnections and relationships. Marketing professionals, both academic and practitioners, and economists have been studying networks and their implications for a long time.

This shows two network hubs. See Leskovec et al. [2007] for a similar network chart

FIGURE 5.12 This shows two network hubs. See Leskovec et al. [2007] for a similar network chart.

Economists have mostly focused on the externalities generated by networks while marketers have focused on how to use them for practical purposes. See Mayer [2009] for a discussion of social networks in economics.

With the advent of social media, people in diverse disciplines have studied networks trying to understand their complexities and characteristics. One major feature is that social media networks are scale-free meaning network members are not homogeneously distributed. They are, instead, in clumps around central hubs or centralities. The example in Figure 5.12 shows clear groupings of people around a central person which is the hub. These hubs of people, because they touch so many others, can be influential both directly and indirectly. The collection of all hubs forms a set called the Key Infill enter Set (KIS). The KIS contains all the people who have wide connections in the network and who, because of those connections, can greatly influence the thoughts, opinions, and behaviors of many others. They could be opinion leaders (KOLs), writers such as columnists and bloggers, early adopters, and so forth. Christianson et al. [2008] refers to a key influencer as someone who has a lot of social ties within a social network.

A message sent to one member of a KIS can be spread to many other people just as a biological or computer virus can be spread to many others - and spread exponentially. This is the basis for viral marketing: the spreading of a message through a network by taking advantage of the network’s interconnections. Since the marketer has to send a message to only one person, the one who is a hub who then starts a chain reaction throughout the network, the costs of message deployment are greatly minimized. The deployment cost is the cost of contacting a hub person after which the marginal cost of deployment is zero. In addition, viral marketing is more effective because the influences, those passing on the message, have credibility, are known entities (i.e., friends, coworkers, family members, recognized thought leaders, experts, etc.) and are viewed as believable, reliable, and trustworthy. A new product manager has every reason to want to use viral marketing: zero marginal cost, rapid dissemination of a message, and high credibility' of those distributing the message. See Zhuang et al. 12013] for some discussion of viral marketing. See

Barabasi and Albert [1999] for a discussion about network scaling. The dispersion of a message from a hub via a social media network does not have the same message degradation issue as for the WOM dispersion in a small market because the social media dispersion is an electronic forwarding. Retweeting a tweet is the best example: the same tweet is spread and not changed as part of the forwarding.

Identifying a hub person, actually, identifying all members of a KIS, is a problem. A naive approach is to identify people meeting characteristics of someone most likely by a key influencer. Some characteristics are listed in Table 5.12. This is an interesting list but it does not lend itself to an operational method for identifying specific people. A higher level of aggregation of the characteristics of key influences is shown in Figure 5.13 but this is also not useful for identifying specific people.

Identifying the KIS, as for the WOM, is not trivial. Zhuang et al. |2013| note that a naive, simplistic method is to scan a social media’s membership records for those with the most connections. This is naive because many people would belong to the same subset of the network thus limiting their usefulness. A second approach is to look at all combinations of some people but the number of combinations would be huge making this computationally intractable. See Zhuang et al. [2013] for a brief comment.

Some social networks are huge to say the least with complex overlapping and interconnected relationships among their members. One person could be directly connected to 100 people, each of whom is directly connected to another 100

TABLE 5.12 This is a list of some characteristics of someone who might be classified as a key influencer or hub in a network. Partially based on Vollenbroek et al. [2014].

Active Mind

Trendsetter

Social Presence

Social Activity

Charismatic

Expertise

Communicative

Power

Shared Interests

Unique

Follow-up Activity

Innovative

Aware

Personal

Amount of Followers

Trustworthy

Early Adopter

Open-minded

This is a higher level of aggregation. Based on Hoffman et al. [2016]

FIGURE 5.13 This is a higher level of aggregation. Based on Hoffman et al. [2016].

people, and so on. The number of pairs of connections is Лх,0Ч П. For N = 100, the number of pairs is 4,950; for N = 200, it is 19,900. In 2016, Linkedln had 500 million members, with 106 million active, in 200 countries. Its active user base has been estimated as 260 million in March 2017.8 For N of this size, as it is for all social media networks, then the number of connections is astronomical. Despite this astronomical number, you have to identify only a small set of people, a key influential set (KIS), to start the epidemic.

Zhuang et al. |2013] propose selecting a KIS of M users by identifying those with the greatest number of connections that are not covered by other members of the set; that is, they are unduplicated. To do this, each member of the social network is first ranked by their number of connections. The one with the largest number is added to the KIS. That person and all his/her connections are then deleted from the main social media database. After the deletion, the now smaller database is sorted again by the number of connections each person has and the one with the largest number of connections is added to the KIS. That person and his/her connections are then deleted from the main social media database. This process is continued until either a target number of influences M in the KIS is reached or the main database is depleted. In either case, the members of the KIS have the largest number of unique friends; they each reach the largest number of people without any duplication of coverage. This is the TURF analysis I described in Chapter 4.

Christianson et al. [2008] outline a system and methods (an “invention”) for developing, testing, and tracking messages targeted to influences for the WOM spreading of messages. The heart of the method is surveys with three key questions to get at WOM:

  • 1. purchase intent;
  • 2. message advocacy; and
  • 3. message amplification.

The purchase intent question could be a 5-point Likert Scale question such as “How likely are you to buy this product?” A message advocacy question could ask how likely someone is to share information about the product. Message amplification measures “the potential WOM spread of the message.”9. A question could ask how many friends or family members the respondent tells about the product. Christianson et al. |2008] suggest a question such as “Of your 10friends whom you talk with most often, how many of them would you tell about this ideal”.

For these questions, and others of their ilk, scores are calculated for the responses for each survey respondent. If five-point Likert Scales are used, simple means over all respondents can be calculated. Another approach is to assign weights to each point of the scales, the weights dependent on how consumers in past studies responded to similar questions and then eventually took the required action. For example, for purchase intent, if, as mentioned by Christianson et al. [20081 100% of the people in past studies who said they would buy actually did buy and 50% who said they are not likely to buy eventually did buy, then the weights of 1.0 and 0.5 can be applied to the responses for these response options. A weighted average index or score could then be calculated.

An interesting feature of the Christianson et al. [2008] method is the maintenance of a database of prior surveys, messages, and responses. This is the potential source for the weights mentioned above. It can also be the source for comparing results of a current study to those of past studies.

Other methods rely on principles from Graph Theory, a mathematical subdiscipline concerned with analyzing the properties of graphs, but not any graph. The graphs of Graph Theory are network graphs, such as the ones shown in Figure 5.12. There are (at least) four centrality identification methods that have been developed for such networks. Landherr et al. [2010] discussed these methods and reviewed their strengths and weaknesses. The four methods are:

Degree of Centrality (DC): the number of direct contacts of any one node in a network. The larger the degree, the wider the distribution of information. This is a simple method, one that is easily interpretable and is intuitive by nontechnical people.

Closeness Centrality (CC): the distance from one node to another. Nodes that are closer allow faster, more efficient distribution of information. Close friends, relatives, and associates are examples of close nodes while distant relatives or people you rarely contact are examples of distant nodes.

Betweenness Centrality (BC): the amount of intervening layers between two nodes in a network. The fewer the number of intervening layers between two nodes, the more direct and undistorted the information flow between the two nodes. Also, information distribution would be faster the fewer the layers the information has to flow through.

Eigenvector Centrality (EC): the degree of “well-connectedness” of two nodes. A node is well-connected if it is connected to other nodes that are themselves well-connected. Information distribution would be wider and faster for high EC.

Table 5.13 summarize these measures.

Another approach to determining the К IS is based on association rules, a methodology used in market research to determine the number of items that “go together” in a shopping cart. The set of items is called an item set or market basket. An example of an item set is hot dogs, hot dog buns, mustard, relish. These are four items someone planning a picnic typically buys. Erlandsson et al. |2016] argue that the methodology used to develop item sets can also be used to identify members of a KIS. See Erlandsson et al. |2016| for a discussion. Also see Qiao et al. [2017] for another approach based on entropy theory. There are many other approaches that are not documented in any professional literature because they are proprietary to consultants. These are basically black boxes that are most likely some form of one of the methods outlined here.

TABLE 5.13 Summary of centrality measures.

Centrality Measure

Characteristic

Degree of Centrality (DC)

Widespread of information

Closeness Centrality (CC)

Less degradation of information; faster, more efficient spreading of information

Betweenness Centrality (BC)

More direct distribution of information; less distortion; higher trust/greater willingness to forward information to next layer

Eigenvector Centrality (EC)

Widespread and fast

Once a KIS is identified, it has to be used in some process. A simple, naive viral marketing process is illustrated in Figure 5.14. Data on a social network (e.g., Facebook) is collected and then sent through a KIS engine that applies one of the above methods. A KIS is produced and then messages are sent to all members of the KIS. A more complex approach, illustrated in Figure 5.15, shows the process from Figure 5.14 on the right but with the “Deploy to Social Media” box moved and a Scoring Engine inserted. The left side of the figure shows a typical direct marketing process involving predictive modeling. This process is coupled with the KIS process because the KIS could be too large to be useful; only the most important members of the KIS have to be used. The process on the left takes the KIS as input, does the predictive modeling of the KIS members, and returns the model results to the Scoring Engine. Only a subset of the KIS is contacted. Basically, the whole process takes the social media database (which could be large as for Linkedln with over a

A simple application of a KIS involves sending messages to all members of the KIS

FIGURE 5.14 A simple application of a KIS involves sending messages to all members of the KIS.

A more advanced application of a KIS involves sending messages to a smaller subset of the members of the KIS

FIGURE 5.15 A more advanced application of a KIS involves sending messages to a smaller subset of the members of the KIS.

100 million active members as noted above) and targets a smaller portion, the really most influential social media members.

  • [1] Likelihood to buy; • Likelihood to recommend;
 
Source
< Prev   CONTENTS   Source   Next >