Measurement Issues in the Uncanny Valley: The Interaction between Artificial Intelligence and Data Analytics
Artificial intelligence (AI) researchers call it “the uncanny valley”: numerous studies indicate that people fear and distrust machines that seem nearly human, but not quite. There are many theories about this, but nothing conclusive (“Uncanny Valley,” Wikipedia, 2020).
The phenomenon can be extended, admittedly without strong empirical evidence of its existence and extent, to many subject areas relating to AI—or machine learning (ML). The latter term has become more popular than the original among researchers still queasy about the history, over the past 50 years or so, of AI being oversold and over-promised, with disappointing outcomes. In particular, it is worthwhile to address the question of how well we know whether an AI/ML system has performed satisfactorily. If it did not work exactly as its developers predicted, how good was it? How sure are we about that assessment? In short, the performance of AI/ML systems is a subject area that clearly requires better measurement and assessment—which, of course, is exactly what good data analytics is about.
The dependence operates in the opposite direction, as well. There are many efforts, recent and ongoing, to improve data analysis using AI/ML techniques. Reviewing and assessing these efforts is well beyond the scope of this chapter—and the author’s expertise. What is relevant, however, is the resulting paradox: if AI/ML methods can quickly solve data analysis problems that defy traditional inference techniques, how sure can we be that the AI/ML solution is correct? How can we better assess whether to trust the AI/ML answer?
These are not merely theoretical issues. AI/ML systems are increasingly in use in a number of application areas, some of which are literally life or death decisions. AI/ML systems are contemplated to direct swarm and counter-swarm warfare involving thousands of supersonic unmanned vehicles. This is a situation in which humans would be incapable of making judgments sufficiently quickly, much less then translating those judgments into thousands of movement and action orders in seconds. Therefore, despite the best intentions and admonitions from the AI/ML research community, we could soon have AI/ML systems making kill decisions. Knowing how much to trust the machines is, therefore, critically important and becoming even more so.
A Momentous Night in the Cold War
The issue of how much to trust machines is not new. On the night of September 26, 1983, Lieutenant Colonel Stanislav Yefgrafovich Petrov had the watch command in the Soviet air and missile defense system. The shooting down of Korean Airlines Flight 007 had occurred just three weeks before, and Soviet commanders were eager to improve their ability to distinguish true threats from false alarms. They had, therefore, upgraded their primary satellite-based sensor system. Now that system was reporting the launch of five ICBMs from the United States toward the Soviet Union.
All eyes in the command center were on LTC Petrov. He recounted later, “I felt as if I was sitting in a hot frying pan.” In a real attack, a few minutes’ delay could cost millions of the lives he was there to protect. However, reacting to a false alarm would precipitate an immense catastrophe, literally ending human civilization as we have known it.
Fortunately, he had another warning system, based on ground-based radars, that he could check. He decided to wait for the ground-based radars to confirm the launches. “I just had this intuition,” he explained later, “that the U.S. would not launch five missiles. Either they would launch one or two, to show they were serious, and give us an ultimatum, or they would launch all 1,053- So I did—nothing. I was afraid that informing my superiors, in accordance with my orders, would start a process that would acquire a momentum of its own” (Petrov obituary, New York Times, September 18, 2017).
LTC Petrov’s story has many implications, but one in particular is noteworthy here: he made the right decision because he acted counter to his orders and refused to trust the machine’s conclusion. His intuition was correct. He had contextual information the machine-based system did not. (The Soviets had analyzed and wargamed what attack profiles the US might employ in various situations.) But suppose the Soviets, or the Americans, or whoever else developed a new detection and warning system, more powerful, more reliable, arguably more trustworthy. How could such a system be taught the intuition on which LTC Petrov relied? How would we know that the system had enough such intuition to be trusted? At this time, these questions are totally unanswered in the AI/ML research.
At least it is possible in kinetic combat to assess some results quickly. In non- kinetic conflict, such as economic and diplomatic confrontations, effects take much longer to appear and are then much harder to link back to causes. In information systems conflicts, the difficulty is even greater. Douglas W. Hubbard, a well-known expert on measurement and risk assessment, declares, “The biggest risk in cybersecurity is not measuring cybersecurity risk correctly” (Hubbard, 2016). What he meant by this is that the threats are mostly events that have never happened, so estimating their probability of occurrence becomes a highly judgmental exercise. The use of Bayesian methods is promising, but then the analyst faces the danger of introducing overly influential biases in the choice of prior probability distributions and in the selection of the presumed process to be modeled. Training analysts to calibrate their estimates of uncertainty—that is, to have a much better understanding of how uncertain they are about their conjectures—improves the resulting risk assessments. In contrast, many popular methods and techniques increase estimators’ confidence in their estimates without actual improvement. If, as in cybersecurity, we cannot avoid relying on opinions, we can at least train the people forming those opinions to be more realistic about how likely they are to be right.
It is also useful to get senior decision-makers accustomed to the fact that analyses based on guesses do not produce certainty or even near-certainty. Moreover, analyses based on highly imprecise data cannot produce conclusions any more precise than the least precise input. Better trained analysts can more effectively insist on these principles when senior decision-makers push for unrealistic assurances. This is one of the areas in which better data analytics can drive better AI.
If anything, Hubbard seems to have been optimistic: it is not clear that many organizations with major responsibilities in the field can even define cybersecurity risk—or even just what cybersecurity is. Some influential organizations advocate developing and applying a maturity model to information management: if everything is forced into well-defined processes, and those processes are rigorously followed, then perfect security will ensue. This approach has a fundamental flaw. The most important fact anyone can know about a detection system is what it can’t detect. However, no metrics of observed adherence to defined processes yield any information on this all-important subject. Only challenge testing does. This, in turn, has the limitation that one cannot test the systems response to challenges one never imagined. Still, with metrics depicting the range of challenges the system has detected, it is at least possible to compare systems and rank them in terms of demonstrated responsiveness to these specified types of threats.