Scientific Datasets: Informetric Characteristics and Social Utility Metrics for Biodiversity Data Sources
Abstract The contribution places biodiversity datasets in relation to other central elements of the modern scientific communication system and defines quantitative analyses of metadata of such datasets as belonging to the intersection of Scientometrics and Webometrics. The analyses show that rank distributions of social utility evidence, such as search events and retrieved and viewed dataset records over a given range of datasets follow power law characteristics. A variety of dataset usage index (DUI) metrics is exemplified and illustrated by dataset indicators from three large, medium and small US and Danish dataset providers observed over a one-year period and compared to recent developments. Metrics discussed are of absolute as well as relative nature and include popularity, social attractiveness, and usage and interest impact scores.
Keywords Science communication • Biodiversity datasets • Webometric analysis
• Social utility; Altmetrics • Dataset usage • Usage indicators • Rank distributions
• Power law
Scientific datasets are becoming increasingly vital to understand as a central component of the modern scientific communication process—Fig. 1. Like for academic publications indexed in traditional citation databases, such as the Web of Science, PubMed or SCOPUS, entire datasets do rarely become deleted from the database or archive. Their original records are rarely edited or erased; but datasets, in particular biodiversity datasets, may indeed be updated and grow in number of records over time or be modified or restructured. This characteristic is associated with the
Fig. 1 The scientific communication process. Revised from Ingwersen (2011)
potential for change also observed in many Web-based documents. However, unlike references given in academic publications crediting influence or direct knowledge import from other publications no common standards are available for crediting scientific datasets across the array of disciplines (Green 2009). Thus, none of the aforementioned citation-based systems explicitly take into account scientific datasets as targeted objects for use in academic work.
For biodiversity data a task force was working on this issue in order to generate recommendations for the foundation of a workable citation mechanism (Moritz et al. 2011). In addition, a set of Data Usage Index (DUI) indicators has been developed (Ingwersen and Chavan 2011). The central indicators for the development of a DUI were based on search events and dataset download instances. The DUI is intended also to provide novel insights into how scholars make use of primary biodiversity data in a variety of ways. Similar to scientometric analyses applying rank distributions, time series, impact measures and other calculations based on academic publications (Moed 2005), the social usage of primary biodiversity datasets has led to observations of their statistical characteristics as well as the development of a family of indicators and other derived significant measures. The indicators can be regarded a kind of social utility metrics which, like citations, ratings or recommendations, may be applied as impact measures in research evaluation and form supporting relevance evidence for retrieval purposes (Ingwersen and Järvelin 2005).
Initially, the presentation places the biodiversity dataset indicators within the framework of Informetrics, as a sub-section of scientometric analysis and associated with Webometrics. This is followed by examples of selected rank distribution properties of biodiversity datasets in order to observe if such distributions are similar to those observed for academic journals and articles, i.e. if they follow Bradford-like long-tail distributions. In such power-law-like cases it is expected that information management solutions similar to those used in repository management and libraries can be applied to biodiversity datasets. In addition, one may expect such statistical properties to lead to useful social utility-based research monitoring metrics. A selection of DUI indicators that are useful from this perspective, such as Usage and Interest Impact scores and relative data usage impact, will be highlighted and exemplified. The presentation ends with a brief discussion of consequences of the biodiversity dataset characteristics from the perspectives of dataset management, retrieval and evaluation.