Ownership, copyright, and the ethics of the unpublished

Emily C. Friedman

Depending on how you quantify it, Manuscript Fiction in the Age of Print 1750- 1900 is a “small-scale” DH project, but with the challenges of a much larger one. The database makes more discoverable works of never-published fiction currently held by dozens of archives in the United States and Great Britain, many of which have only approximate dates of composition and little authorial or provenance information. In its first stage, the project provides bibliographic and genre-specific metadata (much of it linked and open) for each work, while later phases aim to provide scholarly full-text transcriptions compliant with Text Encoding Initiative (TEI) standards and a user interface (UI) that provides diplomatic display of page images alongside those transcriptions.

In this essay I address the murky nature of ownership of and responsibility over these items, particularly examples of anonymous, never-published fiction, and the effects of that murkiness on my project. Through my work over the last decade, I have confronted layers of claims upon these works: first, the physical corpora are commodities acquired by sale and donation and “owned” by several dozen archives, each with their own rules and regulations about photography and the later dissemination of those surrogates. And while the very real costs of digitization, storage, server space, and rights management are also significant factors to the feasibility and timeline of the project (as I will discuss), what has been more perplexing—and daunting—is understanding multinational copyright laws as they apply to often anonymous works of never-published manuscript fiction created during the age of print.

This essay discusses these logistical, budgetary, and legal challenges before turning to the still more-numinous ethical quandaries that arise when working with material that was never part of a commercial print marketplace. Just because something was written (and in some cases, even printed) and circulated does not mean that its authors intended for it to be widely disseminated, and to have a persistent and Googleable existence. I connect the work of this digital project to other projects that provide frameworks for thinking about this area of responsibility.

Manuscript fiction in the age of print: an introduction

The goal of Manuscript Fiction in the Age of Print, 1750-1900 is to make fiction produced but unprinted in their authors’ lifetimes more discoverable as a coherent corpus. For some of these items, publication in print was explicitly undesirable, and the surviving physical object was intended for a focused audience, or at least to circulate outside of the sphere of print. For others, there is clear evidence that suggests or explicitly demonstrates that the author attempted to present their work to print publishers. The vast majority of these works, however, lie somewhere between these two poles of clearly demarcated intentionality. Further study and collection of surviving examples will be necessary in order to truly understand this aspect of literary history. That collection and study will be easier once there is an explicitly stated set of shared vocabulary for what these works are.

The bibliography—approximately 200 works at current count—is held in 36 archives and libraries in the United States, Great Britain, and New Zealand. As I discuss in this essay, these items are described using a variety of terms depending on the nature of acquisition, format, and other variables, which makes them trickier to locate. Given the nature of how these items are catalogued, it is certain that more will be enumerated and described over time. Thus, the project is imagined in phases: the first phase, to create a shared vocabulary to describe these items and to identify common features and connect them to a larger history of practice that extends into the present. For this reason, the initial database includes fields that describe the physical features of the work: the size of leaves, whether the work is bound and in what form, the type of writing media used, the number of hands, and so on. I draw together language from the study of printed books with the language of codicology (the study of manuscripts) in order to describe these features. A supporting data set that describes manuscripts of works that were printed in their author’s lifetimes, or were intended as such, allows us to begin to better understand the ways in which form might give insight into authorial intention.

In addition to these descriptive but controlled fields, a set of fields compatible with Resource Description Framework (RDF) standards allows for the future ingestion of the entries into 18thConnect, one of the aggregation nodes of the Advanced Research Consortium (ARC), the umbrella organization that also includes Networked Infrastructure for Nineteenth-Century Electronic Scholarship (NINES), Medieval Electronic Scholarly Alliance (MESA), and a growing set of other nodes. These aggregation nodes provide a structure for peer review comparable to a scholarly journal or university press, and also host metadata (descriptive data about each object) that allows a user to discover relevant items through the aggregator’s search function. Through integration with ARC/18thConnect, information about these works will be incorporated into a much larger body of metadata, allowing for them to be discovered by researchers. Ultimately, the goal is to transform the database into a full-fledged text set of TEI-5 compliant XML digital editions, transformable into both reading texts and a machine-readable corpus. A future phase, with a UI that allows for a diplomatic display of the manuscript pages and their transcription, is also planned.

In the following sections, I discuss the practical challenges that were known at the outset of the project: discoverability and access, transcription time, and digitization and rights cost. I then discuss the higher order challenges that were not initially known: namely, the numinous state of copyright for never-published literary manuscripts, and the still-more-numinous nature of ethical considerations when dealing with material whose author/creators are unknown.

Access challenges

As noted earlier, the first layer of challenge present when building this corpus was simply accessing the items. I will pass over the most obvious restrictions of geographic distance, as I suspect most researchers are familiar with the commonplace constraints of making one’s way to a collection or archive, and of the varying levels of gatekeeping that may or may not be present in any given institutional context. Instead, the primary access challenge that I found—and that my project seeks to reduce or eliminate—is tied to discoverability within the existing records.

The items I work with cause any number of headaches for cataloguers, and I do not envy them their task. Most lack names, many make precise dating difficult or impossible, and there is often no publisher’s information or place of creation. They are, simply, a mess. They may not have been acquired as fiction, but rather as a part of a larger body of papers, or even inside an item desired for other reasons. Nineteenth-century practices of using blank books designed for other purposes, like account books, for the writing of fiction means that occasionally a book acquired for its nonfictional content will also contain fiction.

Depending on the cataloguing workflow and policies of an institution, there may be two records generated, or the genre considered less important will be mentioned in the unrestricted Notes field, the place that cataloguers put all material that does not fit within the constraints of existing controlled vocabularies. Because there is not specific language to distinguish never-published manuscript material from drafts or other forms, and the boundaries between fiction and other genres can be difficult to assess while processing material efficiently, a variety of terms must be used on a given collection or archive’s finding aids and catalogues to yield a list of potential candidates, all of which must then be inspected no matter how remote the likelihood of finding fiction in a given box of‘miscellaneous literary works’ or commonplace book.

No one institution holds a majority of the items currently described in the database: the largest stakeholders have custody of at most 15% of the currently known items, and even then, they are not necessarily all under the purview of one area of specialization. To read all the unpublished manuscript fiction in the British Library, to name just one such example, one would need to visit three different reading rooms across two floors of the building, each staffed by a totally different set of personnel. This diffusion leads to a new challenge: negotiating with many different types and sizes of institutions in order to gain access to the materials, ideally to photograph or acquire high-quality scans.

Digitization

Digitization of materials in high-quality formats designed for dissemination is costly in a number of ways: digitization of unique material requires great care and precision, and is slow and costly if done right. Hosting large-scale images is itself a costly continuing expense, even if those images are only retained locally—and the price of appropriate server size increases if those images are available to the public. Failure to acknowledge these realities is to dismiss the very real working conditions of our colleagues and intellectual partners who labour in archives, special collections, museums, and libraries.

Most projects working with a modest number of items, like this one, begin as material held by an individual institution or a consortium. Where internal funding is insufficient for the task (as is often the case), granting agencies are structured to meet such needs. For example, the National Endowment for the Humanities funds a variety of grants in order that institutions may publish, digitize, or otherwise make more accessible material in their collections, often through the Division of Preservation and Access or the Office of Digital Humanities. While ambitious, very large projects do exist, they often emerge from a proof-of-concept based at one or a small number of holding institutions.

The Manuscript Fiction project differs from other digitization projects in nearly every possible way: first, as previously noted, no one institution holds more than 15% of the current corpus, and as it continues to grow beyond the initial 36 institutions, this percentage is likely to go down still further. Second, the digitization state of these items varies wildly, from the already digitized and publicly accessible to the undigitized and the unavailable for digitization. In between those poles, there are a variety of affordances and limitations: one special collection has its own high-quality document cameras for patrons, while another restricts photography or bans it entirely. Some collections allow patrons to place an item into their queue for in-house digitization (or pay to bump it to the head of the line), while others require enormous sums even for low-quality scans to be produced. I should note that in most (if not all) cases, there are solid contextual rationales for any of these choices. Further, digitization of unique manuscript materials constitutes a new creation that itself has a term of copyright. Where user photography is not allowed, the institution’s digital copy itself can be placed under copyright protection even if the original item is not, and negotiation for publication permissions must ensue to make these images available to the public.

Because institutional contexts vary so widely, I have been keeping track of the digitization and related rights costs for each item in my database, in the hopes of future grant support for creating a diplomatic display.

Transcription

Transcription of handwritten materials is a lengthy endeavour, and most large- scale attempts at new transcription projects have come from correspondence collections, not fiction. Logistically, letters tend to lend themselves to work- flows that allow for teams with many members over time because of their length. Transcription of fiction is logistically challenging: while undergraduate researchers have been able to complete the shortest and cleanest works in the corpus when given a year, learning the quirks of both eighteenth and nineteenth- century orthography and handwriting takes time. Most internal funding models for undergraduate labour only guarantee a semester or at most a year of funding for labour on top of the 18-25 credit hours per term carried by the ambitious undergraduates I work with.

Large-scale transcription tends to be in partnership with, if not wholly supervised by, the institutions that hold the manuscript material. This, too, makes sense: institutions can have long timelines and make use of various streams of labour. For example, Auburn University Special Collections’ collection of Civil War diaries and letters was digitized and transcribed in-house, the latter process done primarily by undergraduate student labour. Where on-site assistance is unavailable, or cannot meet the high demand, a variety of other strategies have been used. Professional transcription services run about $25 USD a page. Andrew Lang & Joshua Rio-Ross (2011) developed a model to use Amazon’s Mechanical Turk to transcribe and perform quality control on Frederick Douglass’s Diary, with a total cost of $25 USD for the 72-page diary. Other projects, like the Newberry’s Transcribing Modern Manuscripts project, rely on crowd-sourced transcriptions, and dozens of UIs designed to facilitate public crowd-sourced transcription exist for institutions that want help with the task of transcribing their holdings.1 Projects that are able to crowd-source their transcription have several features in common: (1) A single institution (or a consortium willing to work together) holds all of the material to be transcribed; (2) The material is often already digitized in a publicly accessible repository, and transcription is seen as the next phase of making the material accessible; (3) The work to be transcribed is understood to be out of copyright: generally, letters and diaries. Because the Manuscript Fiction project is not emerging from a consortium, and only some of the images are publicly available in high-quality surrogates, setting up an interface for crowd- sourced transcription was impractical.

There are other possibilities on the horizon. As of the time of this writing, one of the goals of machine learning is to produce a highly-accurate form of Handwritten Text Recognition (HTR), so that the first pass of transcription can be automated, in the same way that most printed material goes through Optical

Character Recognition (OCR) and is then cleaned up by humans in the next pass. While very targeted handwriting identification software has been trained to read inputs on forms or in other precisely controlled systems (e.g. addresses on envelopes). HTR in unstructured documents (like fiction) and in older forms of handwriting still lag behind the efficiency of OCR. That will almost surely change in the years to come: Google Cloud’s Vision API currently provides pretrained handwriting recognition that has a fair level of accuracy with even historical handwriting, though there still is a lot of clean up even on the clearest hands. In addition, it is neither simple nor free to set up: as far as I know there is no open user-friendly interface yet developed, and teams that wish to query the API will pay $1.50 USD per page after the first 1000 pages in a month. Like most crowd-sourcing solutions, Google Cloud’s Vision API also requires that the images be publicly available. This is not the case for Transkribus, the tool produced by the Recognition and Enrichment of Archival Documents (READ) project to work specifically with researchers and heritage institutions, which is currently among the best available open solutions. To custom-train Transkribus, the team currently requires a sample of 100 hand-transcribed pages. Unfortunately, very few of the works in the Manuscript Fiction corpus contain enough pages to make reliance on Transkribus reasonable: few of the manuscripts reach much past that length. I find that by the time a human has transcribed 100 pages, it takes less time to complete the transcription by hand than to begin the process of automation.

For these reasons, I decided to focus on slower methods of hand transcription, working from a phased plan to first acquire images in good enough form (relative to price) for the use of study and transcription and make the encoded transcriptions available as they were completed. This plan was based on the belief that these transcriptions, unlike the photographs and scans, were outside of the period of copyright protection. Until fairly recently I believed that the challenges of paying for digitization, negotiating rights to digital images, and planning to host said images later on would be the largest expenses for creating a diplomatic display (text alongside original page images). This, as it turns out, was an inaccurate—or at least, incomplete—view of the situation.

Who owns anonymous? Copyright law for unpublished literary works

Digitization projects and text sets generally tend to focus on work understood to be out of copyright: letters, diaries, and published works whose period limits have expired. Digitized items are thus by named human beings whose rights can be negotiated or legally understood to have lapsed, or by now-unknown humans whose lives were so long ago that their rights have certainly lapsed. This, as I will discuss, is not the case with my text set in several ways.

Depending on the country of jurisdiction (which is itself a thorny issue), never-published works are in a shifting field of copyright. The institutions—or even private owners, like myself—do not necessarily control the intellectual content held within the physical item owned. Archivists and collections librarians are often on the front lines of helping researchers and other users navigate permissions and usage, which is a challenge that is further complicated when digitization enters the picture. The Society of American Archivists (1997-2020) hosts dozens of help pages and directs to many articles and white papers on the subject, and it is wise to begin any inquiries within this wealth of information.

And copyright matters. Ketzan and Kamocki (this volume) note the 2014 US Court of Appeals for the Second Circuit decision in favour of Google and HathiTrust in their chapter, affirming that creating searchable databases is transformative work. However, in practice researchers can search or even create corpora to analyse in HathiTrust, but cannot read the individual works in the corpora if the works are in copyright.

Most Anglophone countries are signatories to several international treaties relating to copyright. The Universal Copyright Convention (UNESCO 1952) addresses unpublished and anonymous works in Article II, III, and IV. All Contracting States agreed to protect unpublished works by their citizens and those of other signatory countries, and to establish a minimum duration of copyright and legal means of protecting unpublished work “without formalities” (i.e. without the necessity of registering the work). Copyright protection duration follows the place of first publication, and for unpublished works protection length is based on the nationality of the author. This notion of the presumed nationality of an anonymous author was reaffirmed in Article 15 of the Berne Convention for the Protection of Literary and Artistic Works (World Intellectual Property Organization 1979). For works without a known author, I have been unable at the time of this writing to assess whether protection duration would then follow the laws of the country in which the material is held.

To give an example: Duke University’s Rubenstein Library holds a manuscript created no earlier than the 1820s (and likely not very much later). As I have written about elsewhere, the final private owner was convinced that the work was written by two Hampshire women in the 1780s—namely, members of the family ofjane Austen.2 Paper dating means this speculative attribution is erroneous, but it is unclear who did compose the manuscript—or where. It is highly likely that the work was created in England, though it is now held in the United States. And even though the work is almost surely 200 years old, by a long-dead author, what country the work “belongs” to matters a great deal.

In the earliest part of the twentieth century that would not have been the case, as both countries had perpetual copyright for unpublished work. The 1909 United States Copyright Act held unpublished works were protected in perpetuity as long as they remained unpublished and unregistered. This perpetual copyright was revoked by the 1976 Copyright Act, which put into place a statutory term of the life of the author plus 50 years, later extended by the 1998 Sonny Bono Copyright Term Extension Act to 70 years. In 2003, works that were not published nor registered for copyright by 1 January 1978 and whose authors died before 1933 entered the public domain. For works whose authors are unknown, never-published material enters the public domain 120 years after creation. Thus, to be safe for works in the United States, you would want to ensure that anonymous works were created prior to 1899. This was the rationale for setting the cutoff date for the Manuscript Fiction Project at 1900. In the United States, the Duke manuscript is thus safely out of copyright: it would be very hard to argue that the work was created after 1899.

Australian copyright for unpublished literary work follows a similar pattern, though the law only changed in 2019: if the author died before 1949, the work’s copyright expired on 1 January 2019, and the copyright for authors alive after 1949 will be 70 years after the creator’s death. For never-published works made by now-unknown authors, copyright holds until 70 years after the work was created. In theory, unpublished works by anonymous creators made as late as 1949 are now out of Australian copyright. If the Duke manuscript had made a very circuitous route through the southern hemisphere, it would be even more securely outside of copyright.

Flowever, the case is very different if the Duke manuscript’s copyright protections are governed by United Kingdom copyright law. While the Copyright, Designs and Patents Act of 1988 similarly ended perpetual copyright for unpublished materials, very old materials did not immediately enter the public domain. Instead, all unpublished literary works produced prior to 1 August 1989 by authors known to have died prior to 1969 will all enter the public domain on 31 December 2039—50 years after the Act. Thus, at the time of this writing in 2019 the Duke manuscript would not be out of copyright in the United Kingdom for another two decades.

While there is a procedure for deeming works older than 100 years old “orphan works”, it only applies to potential uses in the United Kingdom (whatever it means to operate “in the United Kingdom” for a digital project), only lasts seven years (maximum), and licensure costs anywhere from £20 to 80 per work depending on the number of works being licensed. Moreover, that fee is for application, not only for successfully “orphaned” works. Thus, a project that proposed to go live before 2025 would need to reapply three times before the copyright on these works truly “runs out”. This is also true in New Zealand: while the copyright term for unpublished works by known authors is 50 years after the creator’s death, unpublished work by unknown authors is covered by a blanket “50 years after 1995” or 2045.

The Duke manuscript is an example of the harder-to-address situations where neither authorship nor the date of publication is known definitively. To comply with these laws, particularly in the case of now-anonymous works, requires that the material have a definitive and provable date of creation—tricky for many unpublished manuscripts, especially unsigned ones. In such cases, it requires documenting what steps have been taken to discover creation date or author identity. For most works, this is sufficient until it is not—when a legal heir comes forward.

Ethical considerations

Copyright laws address works in their existence as commercial objects, determining who has the right to compensation or profit from a work. When a work was not designed as a commercial object, and thus fails to conform to the expectations of that framework, the system breaks down. For example, the Society of American Archivists’ (2009) flowchart for how to handle orphan works by deceased authors asks early on “Is the Document Relatively Significant and/ or Reasonably Likely to Have Commercial Value?” If the answer is “yes”, then the archivist is thrown back into doing due diligence to identify a descendant with redoubled vigour. If “no”, then work stops after the first wave of searching. Nevertheless, given the dealer prices for works of manuscript fiction by even unknown authors, a librarian or archivist would be hard-pressed to make the case that a totally-unique work was not “reasonably likely to have commercial value”. That said, a never-published work of now-anonymous fiction might also reasonably be understood to have “failed” to reach the market long ago. As of this writing, I have not yet seen either argument made in a court of law, and frankly I hope I will not. Intellectually, I am loath to see a judgment that once again declares these works “worthless” because they did not reach the commercial marketplace. Practically, I know my task would become even more difficult if these texts were assigned monetary value.

Assuming all legal considerations were dealt with, and vast sums of money were dedicated to the permissions, digitization, transcription, encoding, and hosting of this material, we are still left with one final question: should this material be made public to a mass audience? This is a question that archivists and librarians have struggled with for some time. It gave rise, for example, to Mukurtu (2007), a content management system designed with and for Indigenous communities so that communities could decide how to share their cultural heritage. Billed as “a safe keeping place”, the collaboration between members of the Warumungu community, Kim Christen, and Craig Dietrich, allows for the storing, managing, and sharing of sacred materials in community-appropriate ways. It is decidedly not open access—Mukurtu allows for different kinds of login access for members of the community, to ensure that any given user only has access to materials the community’s protocols and cultural values have determined are appropriate for that particular user. As Yokoyama notes in her chapter on the open-access spectrum (this volume), that design was not without its critics, and the ensuing debate exposed the difficult territory where cries for the freedom of information touched dangerously close to settler-colonial appropriation and extractive logic. One person’s access can look like another person’s exploitation.

Another attempt to design creator-led protocols for community materials is the Zinc Librarians’ Code of Ethics, which is best understood as a document of questions and points for discussion. Central to the Code is a sense that just because something was published and circulated does not mean that its authors intended for it to widely disseminate, and to have a persistent and Googleable existence. Further, in the case of many zine creators, they are still alive to consent (or not) and to delineate the terms of their consent for work to be made more discoverable. Sensitivity to how works are acquired, how creators’ legal identities are named (or not) in a catalogue entry, and how to seek permissions for digitization, all ultimately point back to the creator as the best arbiter, wherever possible (Freedman & Kauffman 2013; Berthoud et al. 2015).

These two examples address the needs ofliving communities of practice, who can be consulted on the preservation, dissemination, and transformation of their creations. In the case of unpublished fiction from the eighteenth or nineteenth century, it can be difficult—if not impossible—to determine authorial intention a century or more later. Whether the work was to its author a failed attempt at print, private work for a trusted few, or widely circulated by mail, this context has been stripped from the works in all but a few cases, and it is nearly impossible to recover. From my own perspective I tend to think of these works of fiction as rendered innocuous by the long passage of time and their general disconnect from life-writing. That said, further work with this corpus may reveal that those assumptions are wrong: that in the same way that zines are described by the Code as “so beautifully and wonderfully varied . . . often weird, ephemeral, magical, dangerous, and emotional”—so too may these earlier works, which I know to be varied, weird, and ephemeral already, prove to be magical, dangerous, and emotional as well.

Conclusion

In software development, one first designs for the average or general case: how one expects the typical program to run. But one must also plan for the uses that occur rarely, but catastrophically: the “edge cases”. Unpublished manuscript fiction provides an unusual and therefore useful edge case for discoverability, description, digitization, and dissemination. What is noted here is the current report from a work in progress: technology and the simple march of time will, in some ways, alleviate some of the specific challenges of transcription and copyright limits discussed in this essay. However, questions about our responsibility to material, and the double-edged sword of “access”, will remain as long as some notion of ownership, authority, and privacy remain—which is to say, for a long time to come.

Notes

  • 1 https://publications.newberry.org/digital/ mms-transcribe/index
  • 2 I have also proven that this particular manuscript was a holograph copy of a published work of serialized fiction from the 1770s, but for the sake of this thought experiment 1 am working from our team’s initial assumption that the manuscript was an original work of fiction. For more about the Duke manuscript, see Friedman (2017).

References

Auburn University Digital Library., (2020). Civil war diaries and letters collections [online]. Available from: http://diglib.auburn.edu/collections/civilvvardiaries/

Berthoud, H., Barton, J., Brett, J., Darms, L. et al., (2015). Zinc librarians code of ethics [online]. Available from: https://zinelibraries.info/code-of-ethics/

Freedman, J. and Kauffman, R., (2013). Cutter and paste: A DIY guide for catalogers who don’t know about zines and zine librarians who don’t know about cataloging. In: M. Morrone, ed. Informed agitation: Library and information skills in social justice movements and beyond. Sacramento, CA: Library Juice Press, pp. 221-246. Available from: http:// academiccommons.columbia.edu/item/ac:171812

Friedman, E., (2017). Becoming Catherine Morland: A cautionary tale of manuscripts in the archive. Persuasions. 39, 163-173.

Lang, A. and Rio-Ross, )., (2011). Using Amazon Mechanical Turk to transcribe historical handwritten documents. Codc4LibJournal [online]. 15. Available from: https:// journal.code4lib.org/articles/6004

Mukurtu., (2007). Mukurtu: Our mission [online]. Available from: https://mukurtu.org/ about/

Newberry Library., (no date). Transcribing modern manuscripts [online]. Available from: https://publications.newberry.org/digital/mms-transcribe/index

Society of American Archivists., (2009). Orphan works: Statement of best practices [online]. Rev. 17 June. Available from: www2.archivists.org/sites/all/files/OrphanWorks-June2009. pdf

Society of American Archivists., (1997-2020). An introduction for users of archives and manuscript collections [online]. Available from: www2.archivists.org/publications/brochures/ copyright-and-unpublished-material

United Nations Educational, Scientific and Cultural Organization (UNESCO)., (1952). Universal copyright convention (as revised on 24 July 1971 and including Protocols 1 and 2), 6 September, U.N.T.S. No. 13444, vol. 943. pp. 178-325. Available from: www.refworld. org/docid/3ddb88377.html

World Intellectual Property Organization., (1979). Berne Convention for the Protection of Literary and Artistic Works. Geneva: World Intellectual Property Organization.

 
Source
< Prev   CONTENTS   Source   Next >