Respectful Online Access
From the outset, the project realized that there is a difference between material being publicly available in a repository and what it can mean for individuals, families, and communities when that material is scanned and made available online. On the one hand, there is a need for digitization and online access to mitigate the significant geographical and financial barriers to access material in archival repositories. At the same time, it is important to carefully select what material will be broadcast online. This can be especially important in areas like the Southwest where Native communities feel keenly about having historically been subjected to outside individuals, extracting information and physical objects in colonial encounters, with effects that continue today.3 The Indigenous Digital Archive provides explanation and direction for creating respectful online access.4
Technological Frameworks to Empower Community Researchers
The Indigenous Digital Archive toolkit is a set of open source software components built over the Omeka-S content management system widely popular with libraries, museums, and digital humanities projects, but designed as modular so it is adaptable to other digital repository systems. The IDA toolkit was designed to leverage technological advances of the last half decade to meet community needs and challenges articulated in the library and archives professions.3
Since the beginning of social media and Web 2.0, those in charge of providing access to archives and special collections have talked about the need to create interfaces to bring in diverse voices and share the authority for describing material, particularly for detailed descriptions.6 However, even what was long a model experiment in archives for adding user-driven information to records—the Polar Bear Expedition Digital Collections —is no longer available, leaving almost no trace in the current online information; as without another major project, it was unable to be integrated with tools serving needs in preservation and access.s The need for an effective toolkit compatible with the forward migration of content management systems and digital preservation has not flagged, however. In Kate Theimer’s 2011 volume collecting examples and plans of digital engagement and crowdsourcing in repositories, Yakel notes “an undiscovered power of the social web for cultural institutions” that can be engendered when “authority for the description and representation of [online materials] is shared.”9
In recent years, the impetus for allowing people to contribute to knowledge about collections online has given rise to projects like the Citizen Science projects of Zooniverse, Zooniverse’s new AnnoTate transcription project, and the Smithsonian Transcription Center. These, and similar recent projects, have developed strong tools for repositories allowing people to conduct very specific tasks online, such as full-text transcription, correction of automated optical character recognition (OCR), and transcription of portions of regular records, such as weather logs, an institution’s register of bird specimens, or logs of troop movements, into structured data. In each case by design there is a defined and specific task to accomplish, rather than being an interface to aid exploration of digitized material. In a recent analysis of existing tools facilitating crowdsourced tasks, Ben Brumfield and Mia Ridge10 noted the “line-at-a- time, queue-oriented, multi-track transcription workflow” characteristic of the Zooniverse interfaces, don’t allow for users to return to something they’ve worked on, or see what others have done and discuss among themselves.
These interfaces are designed to collect a limited range of structured data according to a particular research design, and are becoming increasingly sophisticated at directing the workflow. However, these are not tools for exploring collections beyond a specific kind of encounter. Sometimes, the narrow task orientation and gamification of some interfaces might mean that a user is presented with a single image of text that is completely interesting to them, only to see it whisked away after they’ve completed a transcription task, with no clue of where it came from, or ability to see the whole item.
As a tool allowing more flexible interactions with content, FromThePage has made great advances in user interactions around transcriptions and translations, OCR correction, and exploring content based on indexed key terms. FromThePage also integrates and allows use of content stored in the Internet Archive or Omeka, an opensource content management system adopted by many GLAMs (Galleries/Libraries/ Archives/Museums), and allows users to create articles on indexed key terms. Most recently, FromThePage also made the advance of incorporating the system's focused International Image Interoperability Format (IIIF) standard. Being able to work with images in an IIIF compatible environment brings many advantages in user interface, such as the ability for smooth deep zooming, and being able to work with and compare material from multiple IIIF-compatible repositories in one browser interface.
However, tools are still needed to enable work with mass digitized documents. As Brumfield and Ridge11 note, FromThePage is designed for work with small collections, and its limited discover interface means it would be prohibitive to use on collections of hundreds or thousands of documents. Additionally, indexing in FromThePage depends on a transcription in which the subjects are identified and hyper-linked. For mass digitized archival documents, access needs are not always met by transcription. This is not only because full transcription or OCR correction is usually much more time consuming than selecting a tag (a name, event, concept, or place) that would be meaningful for someone looking for the content, but also because what often would be used as a keyword does not actually appear in that text. (For example, a derivative, alternate, or misspelled form of a name is used, or what would receive a keyword tag of “boarding school deaths” appears in euphemistic language.)
The need to create online access to allow collections to reach a wider group of users means that repositories do continue to look to mass digitization as part of their strategies.12 Usability studies on the posting of a large backlog of tens of thousands of archival photographs with minimal processing found that users preferred to be able to access the material even with the most minimal metadata, and in some cases supplied information about the items that staff then used to augment the catalog record.13 At the Hoover Institution Archives at Stanford University, California, which holds over 27 linear miles of mainly 20th century material, Miller14 has suggested mass digitizing unprocessed archival collections, with searchability to be provided through OCR, with the idea that this would give users a level of accessibility that would be familiar to them from Google searches.
The need for more user-directed tagging while exploring content, as opposed to straight transcription or the narrow task orientation of Zooniverse, have led to some projects finding a solution in outsourcing the content (and contributions) to commercial ventures such as Flickr. While people are eager to interact with digitized historic cultural content online in a collaborative environment, and using Web 2.0 tools can enhance a person’s ability to find the digitized documents they’re looking for and provide their input into what is presented with the material,15 using these commercial content services introduces its own problems, such as user privacy from commercial data-mining now or in the future, the large expenditure of effort on creation of material that is “locked” inside the company’s system, the need to conduct additional activities for digital preservation purposes (web harvesting or web archiving of the site), and limitation of formats (e.g. single images). Additionally, there is not yet a match between such services and potentially culturally sensitive material, or where participating community members don’t feel comfortable with the contributions they are making becoming part of a commercial venture rather than being curated by a museum or archive.
With respect to user-directed exploration and tagging, the Mukurtu content management system, with an emphasis on responding to the needs of cultural sensitivity issues in a digital repository, has built an effective system designed for gating and providing access to material appropriate to one’s tribal affiliation, clan, gender, age group, and other Indigenous community-defined considerations. Users may create tags and add commentary, at the level of the entire digital object and catalog record. For now, this constraint limits the ability of Mukurtu to provide access with the same kind of interactivity to mass digitized material, or allow someone to show exactly where in a passage or multi-page document their comment applies.
Talking about the potentials for crowdsourcing in increasing access to collections at the May 2015 Crowd Consortium workshop (as part of the IMLS-funded National Forum in Crowdsourcing for Libraries and Archives), HathiTrust’s Jeremy York noted that while we hold collections for the benefit of our communities, “yet many in the community do not know what is in our collections, [and] they can’t find the useful materials; we are not being as effective as we could in fulfilling our vision.”16 This challenge is still urgent to address.
The IDA’s toolkits that are based on international standards help bridge that gap. The IDA’s software tools that interface with the Omeka-S digital content management system for cultural heritage institutions create an online access and collaboration layer that enables creating effective access to mass-digitized archival documents, including typescript and print documents as well as other images often highly resistant to automated processes such as OCR, through enabling a suite of interactive features based on the system’s focused International Image Interoperability Format standard (IIIF, http://iiif.io) and Open Annotation.
Use of the IIIF format enables a suite of abilities, such as the ability for a user to quote all or part of an image using just a URL, seamless deep zoom, the ability to add keyword tags1 or annotations to a portion of or even a range of images; and even, as seen in IIIF clients such as the Mirador viewer, being able to view and work with objects from multiple IIIF-enabled repositories in one browser interface.18 Use of the Open Annotation format, the standard used in the web annotation software Hypothes.is (maintained by the W3C, the international organization responsible for managing the standards that make the internet able to communicate across different languages and applications), addresses the need to be able to create online collaborations including crowdsourcing applications where the data will stay linked to the source images. It would also be maintained in an internationally agreed standard format that allows it to be sustained and useful outside of any one particular software application, aiding long-term digital stewardship.
The IIIF standard has been adopted by U.S. libraries and institutions such as ARTStor, the libraries of Stanford, Yale, Princeton, and Flarvard universities, and now rhe Smithsonian Institution and others. Digital Public Library of America (DPLA) encourages it as a way for contributing repositories to have better representation of their content. Internationally, IIIF has been adopted by La Bibliotheque Nationale de France; the National Libraries of Wales, Austria, Denmark, Israel, Poland, Serbia, Norway, Australia, and New Zealand; Oxford University’s Digital Bodleian; and others of an expanding list of participating organizations. Shims or patches were made for existing repository systems such as Content DM and the Internet Archive ahead of formal adoption.
The IDA project also builds on the Universal Viewer, an open source project to enable cultural heritage institutions to present their digital artifacts in an IIIF- compliant and highly customizable user interface. The project was initially developed to provide an interface to the content of the many different subunits of the British Library, and the Wellcome Library’s emerging Digital Library Cloud Services (DLCS), a service to provide IIIF-compliant image hosting with additional services for cultural heritage digital projects, including OCR indexing and searching, annotation storage, and easy-to-use APIs.
Tools the IDA has developed greatly complement new tools developed for Omeka- S, a rewrite of Omeka code that came from an IMLS National Leadership grant project that began one year earlier than ours. For example, the new LOD (linked open data) tools Omeka-S are implementing to aid standardization in the collection and object level catalog metadata are complemented by the IDA tool allowing semantic tagging by users within the documents. Additionally, others could take parts of the semantic tagging toolkit that best apply to their applications.
We also began an experiment using natural language processing (NLP) techniques for computer-assisted indexing to build starting points for people to access the records. Our NLP toolkit identifies dates, geographic places, names of individuals, organizations, and tribes, and other topics.
Further, as the ultimate goal of this project is to build an interface that is really focused on creating the best user experience possible, this IDA builds on the emerging work in developing what generous interfaces, user interfaces that rather than present the user with just an empty search box, instead organize and present preliminary faceted data to provide the most information to a person in the most apprehendable way, with the most effective possibilities for next steps for effective interaction.
We matched these national needs for tools among libraries, archives, and museums with the needs articulated among our local communities to refine the toolkits.