The Winograd Schema Challenge

In 2012, the computer scientists Hector Levesque, Ernest Davis, and Leora Mor- genstern came up with an innovative and brand new challenge to use as a more appropriate test for intelligence. These scholars presented an alternative to the TT, which, as we have seen, suffers from many limitations both as a test of “intelligence” and as a test of the adequacy of a simulative model of cognition. The name of the 2012 test is the Winograd Schema Challenge (named after Terry Winograd, the developer of the SHURDLU system described earlier in the book). The challenge consists of solving referential ambiguity (in particular, the problem of anaphoric pronoun resolution) in a “Winograd schema”, which the authors describe as follows: “A pair of sentences differing in only one or two words and containing an ambiguity that is resolved in opposite ways in the two sentences and that requires the use of world knowledge and reasoning for its resolution” (Levesque, Davis, and Morgenstern, 2012: 557). In this context, passing the Winograd Schema Challenge for a program means being able to solve this referential task with “near human levels of success; presumably close to 100%” for a list of collected sentences built according to such schema. In such a collection of sentences, the answer for the pronoun resolution is obvious to humans but cannot be retrieved with classical statistical techniques by machines. Levesque, Davis, and Morgenstern (2012: 554) describe the following four features of the questions used for the challenge: [1]

Examples of these questions (divided in two blocks of Winograd Schemas containing two pairs of sentences each) are as follows:

Sentence 1: I poured water from the bottle into the cup until it was full. What was full?

Answer 0: the bottle Answer 1: the cup

Sentence 2: I poured water from the bottle into the cup until it was empty. What was empty?

Answer 0: the bottle Answer 1: the cup

Sentence 3: Paul tried to call George on the phone but he wasn’t successful. Who was not successful?

Answer 0: Paul Answer 1: George

Sentence 4: Paul tried to call George on the phone but he wasn’t available. Who was not available?

Answer 0: Paul Answer 1: George

As these two blocks of sentences show, in both cases the question asks for a correct disambiguation of the pronouns (“it” and “he” in the examples, respectively) by assigning them to the correct referents in the sentence. It is also evident from these examples that there is always the “special world” used in both the sentence and the question (in the examples, the special and alternate words are “full”, “empty”, “successful”, and “available”). The devised task goes beyond retrieval or statistical matching in large corpora (the authors say that it is “Google-proof”: i.e., an automatic system using Google and statistical techniques will not be able to reliably disambiguate these sentences correctly) and requires resorting to some sort of explicit model and reasoning. The main advantages of the Winograd Schema with respect to the TT concerns the fact that: (1) there is no subjectivity involved: the answer is clear from a “human perspective” and the results can be quantitatively and qualitatively evaluated; and (2) the challenge does not require adopting any expert since the wide range of questions concern commonsense knowledge and reasoning that every speaker of a natural language can handle (in this case, English speakers). Some of the weaknesses of the test, however, are the following: the test is, again, language-centric and anthropocentric (as with the original TT), and therefore cannot be considered either a “general” test for intelligence or a general test for human intelligence. In addition, the same behavioural critiques apply here: the mechanisms through which a system can pass the test can be not structurally valid and be completely “functional”. As a consequence, the test is not feasible in evaluating simulative models of cognition and cannot be used to ascribe “intelligence” (in the human-like sense) to a system. It can, however, be used to evaluate the human-level performance of the developed systems with respect to the human responses (which for this task is 100%).

Another weakness of the test concerns the binary choice. This artificially reduces the chances of errors for the system (and the chance of success is always at least 50%, which is very high for human-level tasks).

  • [1] Two parties are mentioned in a sentence by noun phrases. They can betwo males, two females, two inanimate objects, or two groups of people orobjects. 2 A pronoun or possessive adjective is used in the sentence in reference to oneof the parties, but is also of the right sort for the second party. In the caseof males, it is “he/him/his”; for females, it is “she/her/her”; for inanimateobject it is “it/it/its”; and for groups it is “they/them/their.” 3 The question involves determining the referent of the pronoun or possessiveadjective. Answer 0 is always the first party mentioned in the sentence (butrepeated from the sentence for clarity) and Answer 1 is the second party. 4 There is a word (called the “special word”) that appears in the sentence andpossibly the question. When it is replaced by another word (called the “alternate word”), everything still makes perfect sense, but the answer changes.
< Prev   CONTENTS   Source   Next >