NLP engineering applied to the corpus

For our work, we chose to work with the XML format so as to be able to process specific parts of the text. NooJ allows several document formats to be processed, including XML. As such, we must declare the nodes the ALP processes apply to. Furthermore, the software provides a file presenting the source document dealt with and the extracted pattern. By “pattern”, we understand the extracted form, be it simple (token, lexeme, etc.) or complex (phrases, structures, etc.). The goal of preparing our source document is then to have a sort of Tuple (Source, Pattern) that can be reprocessed by specific scripts, notably to implement these results in a data visualization program [SID 11a, SID 11b] or for re-indexing.

All the documents have thus been transformed into a series of XML documents, thanks to a series of scripts written in Python. For each team report, an initial script recognized the title and subtitle numerations through regular expressions and then divided the text into as many parts as necessary. For a team report, a corpus was then obtained with the team name as its root and a tree view of texts numbered from 1 to n (see Figure 7.8).

XML corpus used to extract information

Figure 7.8. XML corpus used to extract information

A second script was used to transform these texts into XML files. A declaration line of the XML file was added to teach file, followed by three nodes: one node called another corresponding to the title line; and a final , corresponding to the title’s content.

Finally, a third Python script renamed each file with the name from the node to facilitate the reading of the output file. This is a CSV file that will first give the file’s theme - Title - and then the extracted Pattern.

Thus, for each team, we obtained a corpus with an average of seven texts. These corpuses (24, in total) were processed separately by pooling the lexical resources used.

NooJ uses lexical resources like dictionaries and morphological grammars for textual labeling. In this way, we created a simplified version of a dictionary on nanoscience [SID 13b] for finding domain terminology (physics, chemistry and material science). The structure of the dictionary that we propose is made up of the lexeme, followed by its nature (N for noun, V for verb, etc.), which returns to the abbreviations commonly used by existing dictionaries. We have also added several tags aiming to define the conceptual belonging of the term with respect to a tree view of a domain given by the structure of the corpus. Each lexical entry is thus given, for example, as: , where NP labels the result as a “noun phrase”, Lev2 represents the level of the “conceptual hierarchy”, Lev1 represents its “hierarchy level at n+1”, DOM represents its “thematic domain” of belonging and TEAM is the “team” that deals with this subject at the IJL Institute.

Example of the NooJ class dictionary

Figure 7.9. Example of the NooJ class dictionary

The grammars we created allow complex NPs and chemical formulas to be extracted (see Figure 7.9). The first grammar allows NPs (NP_max) from level 1 (i.e. common nouns) to level 3 (a nesting of three nouns juxtaposed or separated by different lexemes) to be extracted. The grammar will identify, for example, terms like “spintronic”, “magnetic moments” or even “states of quantum sinks at ambient temperature”. The grammatical graph presented in Figure 7.7, and implemented in Figure 7.8, allows these elements to be extracted.

NooJ syntactical graphs for extracting NPs (NP or N")

Figure 7.10. NooJ syntactical graphs for extracting NPs (NP or N")

The pattern - i.e. nesting series - is condensed into a variable noted with $() that can then be reused to create dictionary entries or label the text. The output variable will then be attributed a grammatical category that will thereby create a dictionary entry. The iteration of this grammar on the whole text allows thematic dictionaries to be created quickly. This is also the case with NooJ’s “locate” function, which allows the unknown terms to be found with the tag. We then reworked the series of unknown terms to attribute to each of them a grammatical category and descriptive elements with previously defined tags (see Figure 7.10).

The second type of information that we wanted to extract concerns the chemical formulas dealt with in the report. This element is actually very important in the framework of a technological or scientific observation process [LAM 11], because the use and combination of particular chemical elements constitute the excellence of the research. They are therefore determining elements that seemed just as important for us to study as the extracted NPs and patterns. Dictionaries and grammars referring to these chemical structures can be reused at a later time for a weak signal (WS) detection task [SID 11a, SID 11b] on new materials (see Figure 7.11).

Labeling chemical formulas present in a specialized text

Figure 7.11. Labeling chemical formulas present in a specialized text

< Prev   CONTENTS   Source   Next >