Corpus specificities

The corpus used for this work is an evaluation report with around 800 pages, made up of six volumes describing the scientific activities of each of the laboratory’s teams, with three volumes collecting financial data, pertinent to human resources. To create the class hierarchy of knowledge, we only used the first volume of this report, amounting to 300 pages and describing the laboratory’s activities for the period from 2007 to 2010.

This document was created through the contribution of each of the laboratory’s teams with a common report, gathering team presentation, contractual activity and presentation of scientific themes. The advantage of this document is to present a structured text for all teams. The use of different parts will be subordinate to the informational need: the “team presentation” part can give way to a specific work on the named entities “proper nouns” and “localization”, for example, while other parts can be used to extract other types of information (know-how, skills, competences, innovation needs, etc.). For our study, we chose only to use the “scientific themes” part, whose structure greatly facilitates the information extraction process.

The organization of the text with its titles and subtitles allows a relatively precise conceptual tree view to be obtained for each scientific theme provided by the teams. The information extraction process can therefore be supported by this editorial structure of the text to elaborate the hierarchy we need [SID 13a].

The homogeneity of the discourse is also an element in favor of the information extraction task. At the rhetorical level, the persuasive and argumentative character of the text can be felt strongly upon reading the report. The quality of the writing at both the grammatical level - short, simple sentences - and the lexical level - repetition of terminology particular to writing on the hard sciences - is also rather homogeneous. At the structural level, each theme follows the classical IMRD (i.e. Introduction, Materials and methods, Results and Discussion) scientific process.

