Benchmarking Semantic Named Entity Recognition Systems
Named entity recognition (NER) became one of the most exploited means for information extraction and content enrichment. The NER systems detect text fragments identifying entities and provide classiﬁcation of the entities into a set of pre-deﬁned categories. This is usually a ﬁxed set of raw classes such as the CoNLL set (PERSON, ORGANIZATION, LOCATION, MISCELLANEOUS), or classes from an ontology, such as the DBpedia Ontology. However, it is a recent trend that the NER systems such as DBpedia Spotlight to go beyond this type classiﬁcation and also perform unique identiﬁcation of the entities using URIs from a knowledge bases such as DBpedia or Wikipedia. During LOD2, we have created a collection of tools adhering to this new class of Wikification, Semantic NER or Entity Linking systems and contributed it the Wikipedia page about Knowledge Extraction.
While these Semantic NER systems are gaining popularity, there is yet no oversight on their performance in general, and their performance in speciﬁc domains. To ﬁll this gap, we have developed a framework for benchmarking
of NER systems. Since diﬀerent NER systems might perform better in one and worse in another domain, we have also developed two annotated datasets with entities, the News and the Tweets dataset. The Tweets datasets, consists of very large number of short texts (tweets), while the News dataset consists of standard-length news articles.
A prerequisite for benchmarking diﬀerent NER tools is achieving interoperability at the technical, syntactical and conceptual level. Regarding the technical interoperability, most of the NER tools provide a REST API over the HTTP protocol. At the syntactical and conceptual level we opted for the NIF format, which directly addresses the syntactical and the conceptual aspects. The syntactical interoperability is addressed using the RDF and OWL as standards for common data model, while the conceptual interoperability is achieved by identifying the entities and the classes using global unique identiﬁers. For identiﬁcation of the entities we opted for re-using URIs from DBpedia. Since diﬀerent NER tools classify the entities with classes from diﬀerent classiﬁcation systems (classiﬁcation ontologies), we perform alignment of those ontologies to the DBpedia
In the future, we hope to exploit the availability of interoperable NIF corpora as described in .