The GenLink Algorithm

Creating a good linkage rule by hand, that is choosing and combining appropriate operators and thresholds, is non-trivial and time-consuming. One way to reduce this effort is to use supervised learning to generate links from existing reference links, which contain pairs of entities labeled as matches or non-matches. Creating such reference links is much easier than writing linkage rules as it requires no previous knowledge about similarity measures or the specific linkage rule format used by the system. Usually, reference links are created by domain experts who confirm or reject the equivalence of a number of entity pairs from the input datasets.

In [8] we have presented the GenLink algorithm for automatically learning linkage rules from a set of reference links. The algorithm is based on genetic programming and generates linkage rules that can be understood and further improved by humans.

Genetic programming starts with a randomly created population of individuals, where each individual is represented by a tree which is a potential solution to the given problem. In Silk, in order to reduce the search space, before generating the initial population we build, given a set of positive reference links, a list of property pairs which hold similar values, and then use this set to build a random linkage rule.

Starting with the initial population, the genetic algorithm breeds a new population by evolving selected linkage rules using the reproduction, crossover and mutation genetic operators. A fitness function is used to assign a value to each linkage rule which indicates how close the rule is to the desired solution. We use the fitness measure based on Matthews correlation coefficient, and penalize linkage rules based on their number of operators. Based on the fitness of each linkage rule, the selection method, tournament selection in our case, selects the rules to be evolved. The algorithm iteratively evolves the population until either a linkage rule has been found which covers all reference links or a configured maximum number of 50 iterations is reached.

The experimental evaluation [8] shows that the GenLink produces better results than the state-of-art genetic programming approaches for identity resolution.

The ActiveGenLink Algorithm

For most real-world datasets it is not feasible, however, to manually create reference links covering a big enough subset of pairs of entities. Moreover, in order for the supervised learning algorithms to perform well on unknown data, reference links need to include all relevant corner cases. For example, while for most pairs of movie descriptions comparing titles is enough to establish that the two refer to the same movie, there might exist variations in titles of the same movie, or different movies having the same title but different release dates. The user has to label a very large number of randomly selected pairs in order to include these rare corner cases reliably.

To reduce the number of candidates to be labeled, we employ active learning, which in our case means iteratively selecting the most informative entity pair to be labeled by the user as matching or non-matching. In [9] we introduced the ActiveGenLink algorithm, which evolves a population of candidate solutions iteratively while building a set of reference links. The algorithm starts with a random population of linkage rules and an initially empty set of reference links. In each iteration, it selects a link candidate for which the current population of linkage rules is uncertain, from a pool of unlabeled links using a so called query strategy. After the link has been labeled by a human expert, the algorithm evolves the population of linkage rules using the GenLink algorithm and the extended set of reference links.

The query strategy Silk implements is based on a query-by-committee strategy: the selected link candidate is determined from the voting of all members of a committee, which consists of the current linkage rules in the population. We take as a baseline the widely used query-by-vote-entropy, which selects the candidate for which the members in the committee disagree the most, and introduce an improved strategy as follows. Firstly, as the unlabeled links are not distributed uniformly across the similarity space, we aim at distributing the links onto different clusters by selecting links that, based on the Jensen-Shannon divergence, are different from already labeled links. Secondly, the voting committee, i.e. the evolved population of linkage rules, may contain suboptimal linkage rules that do not cover all reference links. We implement the restricted committee voting, in which only linkage rules which fulfill a specific reference link are allowed to vote. Our experiments show that the improved query strategy outperforms the query-by-vote-entropy baseline.

The performance of the ActiveGenLink algorithm was evaluated on the same datasets as we used to evaluate the supervised GenLink algorithm [8]. The results show that by labeling a small number of links, ActiveGenLink achieves a comparable performance to GenLink on the complete reference link set. One of the datasets on which the evaluation was performed is SiderDrugBank from the Ontology Alignment Evaluation Initiative (OAEI) 2010 data interlinking track[1].

Our evaluation showed that with ActiveGenLink only about 30 links had to be labeled until a linkage rule could be learned which achieves an F-measure similar to the one GenLink gets using all 1718 reference links.

Two other experiments were done on the datasets that have been used frequently to evaluate the performance of different record linkage approaches: Cora and Restaurant datasets[2]. The results show that labeling a small number of links is enough to reach high performance. In addition, we successfully evaluated how the learned linkage rules compare to rules manually created by a human expert for the same dataset, and studied the scalability of the proposed approach. For the details of all the evaluation experiments the reader is referred to [9].

In order to further improve the linking precision we have developed the Silk Free Text Preprocessor [12], an extension of Silk for producing a structured representation for linking the data that contains or is derived from free text.

  • [1]
  • [2] XML version: duplicate detection.html
< Prev   CONTENTS   Next >