The acquisition of data from different sources

When acquiring data from different sources, two essential problems must be taken into consideration:

Data enrichment and fusion

Several information sources are frequently used in data acquisition and as a result, there must be a mechanism to recognize the same instances, which is important for fusing data from various sources but describing the same instance. For example, a source could provide information on the identity of an artist and another source provides detailed information on his works.

Being able to recognize that the two sources are talking about the same artist is an essential step for data fusion and providing a comprehensive representation of this artist. This identification problem is evoked by CIDOC CRM as a data integration problem where an entity can have different names. For example, Mona Lista and Giaconda, Yalta and Jalta, etc. CIDOC CRM allows several identities to be associated with the same resource using the class E41_Appelation and its subclasses, like E82_Actor_Appelation and E44_Place_Appellation.

An extract of the detailed information on YALTA in the TGN thesaurus

Figure 6.10. An extract of the detailed information on YALTA in the TGN thesaurus

Furthermore, CIDOC CRM allows external knowledge sources like TGN (Thesaurus of Geographic Names) and ULAN (Union List of Artist Names) that provide detailed identification information on instances like an author or a location to be used, including the preferred and alternate names in different languages.

The semantic web also offers mechanisms to resolve this problem. For example, OWL (Web Ontology Language) provides the property owl:sameAs to explicitly confirm that two resources are really the same instance. It is possible to automate this task through the use of several techniques like machine learning, ontology alignment, semantic similarity, etc.

In fact, more and more data are being published according to linked data norms, which is very useful because the data in different knowledge bases are tied to one another and are accessible by programmatic methods. For example, a SPARQL query on DBpedia produces a result that includes links to resources on Yago, which in turn refer to other resources on Wordnet, and so on and so forth.

We have established dynamic mapping between the CIDOC CRM ontology and the knowledge in various knowledge bases using our automatic classification approach, which we briefly described in section

This is very useful for the enrichment and interconnection of local data with external knowledge in a harmonized way, which creates a rich semantic network that can be exploited by various semantic web technologies.

As an example, a user wishing to add information on the painter of the Mona Lisa in our local knowledge base could start typing the name Leonardo da Vinci in the “Painter name” form field to obtain a list of artist names extracted dynamically from various external knowledge bases [AMA 16].

We only propose the names of artists for this form field, which means that the user obtains semantically disambiguated propositions classified according to the CIDOC CRM ontology used in our knowledge base.

< Prev   CONTENTS   Source   Next >