The ever-growing world of data is largely unstructured. It is estimated that information sources such as books, journals, documents, social media content and everyday news articles constitute as much as 90 % of it. Making sense of all this data and exposing the knowledge hidden beneath, while minimizing human eﬀort, is a challenging task which often holds the key to new insights that can prove crucial to one's research or business. Still, understanding the context, and ﬁnding related information are hurdles that language technologies are yet to overcome.
Rozeta is a multilingual NLP and Linked Data tool wrapped around STRUTEX, a structured text knowledge representation technique, used to extract words and phrases from natural language documents and represent them in a structured form. Originally designed for the needs of Wolters Kluwer Deutschland, for the purposes of organizing and searching through their database of court cases (based on numerous criteria, including case similarity), Rozeta provides automatic extraction of STRUTEX dictionaries in Linked Data form, semantic enrichment through link discovery services, a manual revision and authoring component, a document similarity search tool and an automatic document classiﬁer (Fig. 3).
The Rozeta dictionary editor (Fig. 4) allows for a quick overview of all dictionary entries, as well as semi-automatic (supervised) vocabulary enrichment/link discovery and manual cleanup. It provides a quick-ﬁlter/AJAX search box that helps users swiftly browse through the dictionary by retrieving the entries that start with a given string, on-the-ﬂy. The detailed view for a single entry shows its URI, text, class, any existing links to relevant LOD resources, as well as links to the ﬁles the entry originated from. Both the class and ﬁle origin information can be used as ﬁlters, which can help focus one's editing eﬀorts on a single class or ﬁle, respectively.
To aid the user in enriching individual entries with links to other relevant linked data sources, Wiktionary2RDF recommendations are retrieved automatically. The user can opt for one of the available properties (skos:exactMatch and skos:relatedMatch) or generate a link using a custom one. Furthermore, the Custom link and More links buttons give the user the ability to link the
Fig. 3. Rozeta: dictionary selection
Fig. 4. Rozeta: dictionary management
selected dictionary phrase to any LOD resource, either manually, or by letting the system provide them with automatic recommendations through one of the available link discovery services, such as Sindice or a custom SPARQL endpoint.
Fig. 5. Rozeta: text annotation and enrichment
Text Annotation and Enrichment
The text annotation and enrichment module, used for highlighting the learned vocabulary entries in any natural language document and proposing potential links through custom services, can be launched from the dictionary editor, or used as a stand-alone application.
The highlighted words and phrases hold links to the corresponding dictionary entry pages, as well as linking recommendations from DBpedia Spotlight, or custom SPARQL endpoints (retrieved on-the-ﬂy; sources are easily managed through an accompanying widget). The pop-up widget also generates quick-link buttons (skos:exactMatch and skos:relatedMatch) for linking the related entries to recommended Linked Open Data resources (Fig. 5).