The exploitation of digital archives in the humanities

There are diverse methods of exploring archives offered to humanists, and they engage, as was just said, in more or less profound cooperation with computer scientists; they may also concern different procedures according to which are proposed:

- the exploration of textual corpuses (primary archives, therefore in plain text);

  • - the exploitation of databases structured by researchers themselves;
  • - the exploitation of data from bibliographic tools used by these same researchers during their work (secondary archives).

The analysis of corpuses in plain text was the first computer-assisted method of analysis for literary researchers: the disciplinary proximity between literary researchers and linguists, the obligation of the former to know the basics of the latter’s discipline to pass the agregation encouraged cooperation a priori. This is not the place to retell the story of the encounter between textual analysis and literary research, but we will be permitted to remind the reader that these initial experiences left a lasting mark on the representations of humanists: the vision of a computer science that could only with difficulty find what could be deduced much more quickly from a reading informed by the knowledge of a scholar. Counts of occurrences, cooccurrences, collocations and all the finer functions made available by the various available tools of textometry (of recent note, TMX, a new product from an ANR at the ENS Lyon, but also Iramuteq, just to mention the most well-known examples) proved to be much more pertinent to the study of vast journalistic corpuses than for dealing with very formally elaborated texts, also less extensive, and where a sophisticated syntax could very well nullify the orientations that were believed to have been detected in the lexicon. To be pertinent in researchers’ eyes, computerized tools had to offer help in the face of tasks that an individual could not perform alone: their use was thus reserved to the study of very important corpuses, like genre studies, of which it can be said, incidentally, that they contributed to rekindle them, together with the institutional injunction to undertake complicated team projects to justify loans. These supercorpuses falling under history or media communication more often than literature, it should come as no surprise that historians of literature and publishing are overrepresented in these tasks, and thus in our examples. The example of the Prelia project could be given, devoted to the socio-editorial field of small symbolist magazines at the end of the 19th Century, and rare case of a project run by a literary specialist who knew how to code, or even the super-database of literary correspondence from the Enlightenment devised at Stanford, “Mapping the Republic of Letters”, and at the root of the Palladio tool which will be at issue again later.

Finally, we have essentially alluded here to the limits of research in plain text by assuming that these “archives” are established and exploitable, yet with a simple attempt to work with slightly older documents digitized by the National Library of France, it is easily understood that this enormous mass of data is not subject to automatic processing: the limits of the OCR used and the complexity of the organization of pages in certain publishing formats particularly complicate the data: this is thus the case of randomly placed scholarly apparatuses, but also in the case of 19th Century press, the mere existence of the “feuilleton”, an important place for narrative experimentation at the time and one that shows up on ordinary pages, but separated from the top of the page by a borderline that is difficult for algorithms to interpret. There is worse: we have worked together on a Franco-German project implying the constitution of a vast bilingual corpus of war stories: if the indexing of the named entities was surmountable by the French team, the use of Fraktur font, particularly troublesome for OCR, by German publishers at the time meant detouring through a specialized structure that (even academically) would sharply increase the price of the project.

Most often, in order to produce interesting results, the “corpuses” accessible en masse for processing in plain text must first become true archives: here again, we find the break gesture characterizing archiving according to Derrida, the separation gesture according to de Certeau, and which allows the corpus to be structured and made exploitable. Of course, it is tempting to exploit “loose” corpuses, especially when Google makes simple tools like the NGram viewer available to us, but the opacity of Google’s methods for establishing and structuring these corpuses makes the results difficult to evaluate except in terms of frequency [PEC 11].

The humanists who wish for rigorous analyses but dreamed that computer science would do the job also find themselves faced with indexing troubles: yet if these issues are familiar to researchers in information sciences, the humanists, who are rarely trained in indexing, do not know the TEI and do not have the resources around them that would allow them, for example, to index the “named entities” that they often need. The arduousness of indexing tasks is undoubtedly the reason that most archives exploited by humanists are databases that are already established or created over the course of the project. Most of the time, studies give rise to tabulated databases, subject to various treatments, extractions and visualizations (in the best of cases, as loose “studies” of “.doc” files are often the first intention, etc.).

Among many others, we can cite the example of the database being created by the ambitious ANR Anticipation (“Anticipation. Scientific futuristic novels at the turn of the 19th Century (1860-1940)”, directed by Claire Barel-Moisan, 2014-2018), which we have been able to follow as “technical consultancy” for the program. This database strives not only towards a descriptive goal (integrating the narrative structure and characters of the futuristic stories studied, the structuring thus being proposed by categories originating in narratology) but also towards a hermeneutic goal, for the metapoetic indicators coexist with the description of the texts. As with other projects, the time it takes to create the database is such that it is possible for the first results on a “complete” corpus or one considered as such not to be accessible until the end of the program, the group members producing numerous and incidentally interesting “Ancien Regime” analyses in this time, fed by gathering the already-acquired records that consequently remain at the stage of preparatory notes[1]. This gap between research time and institutional program time is an obstacle common to many humanist projects. Requiring long reflection to fix the framework of the studies, and supposing that almost the entire corpus has been scrutinized so that the expected powerful quantitative descriptions can be considered, these databases are forcibly disconnected from the time of the researchers, who must produce in this period to justify their financing, of which they are less a tool than a product, just as an administrative archive can be. Here we see the reappearance of this trait particular to the digital archive: it is also often, if not more so, the trace of a research task rather than an autonomous resource exploited by researchers (as bibliographical notes would be). This difference in time between the moment of program financing and that of finally delivering an exploitable database draws attention to the need to think beforehand about the way that this database will exist once the financing is over, but also about its interoperability: in order that what has become “an archive” in the literal sense of the word may be used by other researchers, it is once again necessary for it to be accessible online, on a durable server. The program’s server generally stops with the financing, and the university laboratory sites only offer fragile guarantees if they are not backed by national organizations. The archiving offer from TGIR Huma-Num[2] responds hereafter to this need: it is again necessary to know the device and anticipate the time needed to implement it.

The exploitation of an archive made up of bibliographic bookmarks like those that a data-harvesting tool like Zotero[3] can provide, or Diigo[4] to a lesser degree, shows the same character of secondary archives, a character just as quantitative, but much more spontaneous. It assumes that researchers tag or index the works or citations found on the Internet, even in real libraries in the case of Zotero: by then recovering the data in CSV format, visualizations based on this library or tags associated with it by the researcher can be produced. In this specific case, usage evolves towards a sort of folksonomy: without computer skills, users add information, and this true archive that is a bibliographic database gathered in a shared library benefitting from everyone’s studies becomes a material just as precious as that of databases. We can foresee the exploitation of the themes of these bibliographic notes, thanks to visualization tools implemented for Zotero, such as Paper Machine[5]. This tool offers quality visualizations including Sankey diagrams, and its use is well documented by an efficient tutorial. In practice, the difficulty lies in involving researchers and not just in a generation that has made little use of bibliographic indexing tools: if history faculties start educating their students in the use of bibliographic tools, literary specialists are still wondering about their utility for them, and it should be noted that their use is hardly commonplace or spontaneous, such that “I have to start using Zotero” is one of the favorite phrases heard around various projects at the end of each presentation of its virtues. We have implemented this technology for another project aiming, among other things, to detect mentions of an author, Jacques Delille, in writings after his death in 1813 in order to study his trace in French publications, that is, the evolution of the fame of a person who was, during his life, the equivalent of an Ancien Regime Victor Hugo and whose reputation very quickly became the complete opposite over the course of the century. Furthermore, the use of these bibliographic tools allows us to start filling the RDF database created by the University of Basel’s Digital Humanities Lab[6].

  • [1] Again, this ANR has not only profited from a rapid start, thanks to the anticipation oftechnical questions, but also from the closeness of certain researchers on the team withprogramming colleagues working in the same IUT (French Polytechnic University).
  • [2]
  • [3]
  • [4]
  • [5]
  • [6] The “Reconstructing Delille” project is financed by the Swiss National Science Foundation,but it benefits from the rare existence of a structure dedicated to the digital humanities, whichhas created a platform of RDF databases called Salsah and whose first sizeable project was aHyperHamlet, a sort of database for reuses of Shakespearean verses in the works of more than3,000 authors. See
< Prev   CONTENTS   Source   Next >