Knowledge Base Creation, Enrichment and Repair
Abstract. This chapter focuses on data transformation to RDF and Linked Data and furthermore on the improvement of existing or extracted data especially with respect to schema enrichment and ontology repair. Tasks concerning the tripliﬁcation of data are mainly grounded on existing and well-proven techniques and were reﬁned during the lifetime of the LOD2 project and integrated into the LOD2 Stack. Tripliﬁcation of legacy data, i.e. data not yet in RDF, represents the entry point for legacy systems to participate in the LOD cloud. While existing systems are often very useful and successful, there are notable diﬀerences between the ways knowledge bases and Wikis or databases are created and used. One of the key diﬀerences in content is in the importance and use of schematic information in knowledge bases. This information is usually absent in the source system and therefore also in many LOD knowledge bases. However, schema information is needed for consistency checking and ﬁnding modelling problems. We will present a combination of enrichment and repair steps to tackle this problem based on previous research in machine learning and knowledge representation. Overall, the Chapter describes how to enable tool-supported creation and publishing of RDF as Linked Data (Sect. 1) and how to increase the quality and value of such large knowledge bases when published on the Web (Sect. 2).
Linked Data Creation and Extraction
DBpedia, a Large-Scale, Multilingual Knowledge Base Extracted from Wikipedia
Wikipedia is the 6th most popular website, the most widely used encyclopedia, and one of the ﬁnest examples of truly collaboratively created content. There are oﬃcial Wikipedia editions in 287 diﬀerent languages which range in size from a couple of hundred articles up to 3.8 million articles (English edition). Besides of free text, Wikipedia articles consist of diﬀerent types of structured data such as infoboxes, tables, lists, and categorization data. Wikipedia currently oﬀers only free-text search capabilities to its users. Using Wikipedia search, it is thus very diﬃcult to ﬁnd all rivers that ﬂow into the Rhine and are longer than 100 km, or all Italian composers that were born in the 18th century.
Fig. 1. Overview of DBpedia extraction framework
The DBpedia project [9, 13, 14] builds a large-scale, multilingual knowledge base by extracting structured data from Wikipedia editions in 111 languages. Wikipedia editions are extracted by the open source “DBpedia extraction framework” (cf. Fig. 1). The largest DBpedia knowledge base which is extracted from the English edition of Wikipedia consists of over 400 million facts that describe 3.7 million things. The DBpedia knowledge bases that are extracted from the other 110 Wikipedia editions together consist of 1.46 billion facts and describe 10 million additional things. The extracted knowledge is encapsulated in modular dumps as depicted in Fig. 2. This knowledge base can be used to answer expressive queries such as the ones outlined above. Being multilingual and covering an wide range of topics, the DBpedia knowledge base is also useful within further application domains such as data integration, named entity recognition, topic detection, and document ranking.
The DBpedia knowledge base is widely used as a test-bed in the research community and numerous applications, algorithms and tools have been built around or applied to DBpedia. Due to the continuous growth of Wikipedia and
Fig. 2. Overview of the DBpedia data stack.
improvements in DBpedia, the extracted data provides an increasing added value for data acquisition, re-use and integration tasks within organisations. While the quality of extracted data is unlikely to reach the quality of completely manually curated data sources, it can be applied to some enterprise information integration use cases and has shown to be relevant in several applications beyond research projects. DBpedia is served as Linked Data on the Web. Since it covers a wide variety of topics and sets RDF links pointing into various external data sources, many Linked Data publishers have decided to set RDF links pointing to DBpedia from their data sets. Thus, DBpedia became a central interlinking hub in the Web of Linked Data and has been a key factor for the success of the Linked Open Data initiative.
The structure of the DBpedia knowledge base is maintained by the DBpedia user community. Most importantly, the community creates mappings from Wikipedia information representation structures to the DBpedia ontology. This ontology uniﬁes diﬀerent template structures, both within single Wikipedia language editions and across currently 27 diﬀerent languages. The maintenance of diﬀerent language editions of DBpedia is spread across a number of organisations. Each organisation is responsible for the support of a certain language. The local DBpedia chapters are coordinated by the DBpedia Internationalisation Committee. The DBpedia Association provides an umbrella on top of all the DBpedia chapters and tries to support DBpedia and the DBpedia Contributors Community.