KBase Credit Metadata Schema

As part of KBase’s commitment to promote open science, we offer users the ability to obtain a DOI (Digital Object Identifier) for their work, which can then be cited in an associated science publication. To further support the community-wide shift towards FAIR (Findable, Accessible, Interoperable, Reusable) data, KBase is expanding our data descriptors so that KBase DOIs have comprehensive citations for datasets, in addition to referencing publications or software used in the workflow. This helps encourage a culture of giving attribution for all research inputs and outputs; standard practice for literature, but still relatively new for software products or datasets. It also promotes open science by building trust that contributors get credit for their work, and accelerates knowledge discovery by supporting and incentivizing the release of data.

Users of KBase will be familiar with KBase Narratives, the reproducible notebooks in which data are analyzed and turned into discoveries. As covered in a previous news article, these notebooks can be converted into Static Narratives (https://www.kbase.us/static-narratives/), which capture a publishable snapshot of the research. Once the user has created a well-documented static Narrative, they can request for KBase to obtain a DOI. This enables the Narrative to be Findable and Accessible (FA in FAIR); KBase takes care of Interoperability and Reusability (IR in FAIR). The end result is a research product with a citable, resolvable, persistent ID that adheres to the FAIR data principles [1].

There are already over 50 published FAIR Narratives with DOIs, including tutorials, teaching materials, and datasets (see https://github.com/kbase/credit_engine/blob/develop/docs/kbase_dois.md). To create a FAIR Narrative and get a DOI, contact us at [email protected].

KBase is engaged with open science efforts to ensure that credit and attribution are inherent to the research process. For example, when the FAIR Narrative DOI is minted, all software and references with DOIs are cited. As sample information, data, and data products are contributed to the KBase platform, we strive to ensure that sample and data ownership are recognized, and all downstream analysis and reuse of those resources are tracked through the KBase provenance system. But KBase can do better [2]! The publishing world of persistent identifiers (PIDs) is growing, with more and better ways to track research products beyond just those published with DOIs. A new and improved system of recognition will ensure that collectors of the samples and creators of the data are able to receive credit for their contributions. By providing rich metadata for all research products, and by leveraging the KBase provenance system, we can link derived data to the original work and provide quantitative numbers on the reuse and impact of sharing data. The goal is to foster a culture of collaborative science that accelerates discovery and innovation.

To support this mission and to provide a model for collaborators, we have created the KBase Credit Metadata Schema (KBCMS; https://kbase.github.io/credit_engine/) to capture citation metadata for data of all kinds. The initial targets for KBCMS metadata collection will capture citation information for samples and data (e.g. assemblies; genomes; metagenomes), but we anticipate being able to extend it in the future to cover other citable KBase products, such as reports, tutorials, or workflows, with minimal changes.

The fields in the schema are based on the Earth Science Information Partners (ESIP) recommendations for dataset citation [3], which include: author(s); date of publication; dataset title; publisher/repository; a persistent resolvable identifier (e.g., DOI); dataset version (optional), and when it was accessed, if appropriate. The KBCMS can also capture additional metadata, such as the role of each contributor to the project (leverages the Credit Role Ontology, CRediT [4]) and the funding source(s) for the project.

The KBCMS was designed by comparing existing schemas used by agencies that support or consume data citations, including DataCite, Crossref, U.S. Department of Energy Office of Scientific and Technical Information (OSTI), and ORCID. The KBCMS was then tested against metadata available from existing and prospective data sources, including Joint Genome Institute (JGI), the National Microbiome Data Collaborative (NMDC), Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE), the Environmental Molecular Sciences Laboratory (EMSL), and others.

Mappings (crosswalks) from KBCMS terms to DataCite, Crossref, OSTI, and ORCID are included in the schema file, and terms have also been mapped to schema.org terms to enable optimal search engine indexing and to populate Google Datasets.

The schema is written in LinkML with automated translations available as JSONschema and Python classes. The KBCMS is compliant with the Joint Declaration of Data Citation Principles established by FORCE11 [5] and with community standards, and is interoperable with schema.org and JSON-LD.

For more information on creating a FAIR narrative and getting a DOI, please contact [email protected].

The KBase Credit Metadata Schema and accompanying documentation can be found at https://kbase.github.io/credit_engine/

References

Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018. https://doi.org/10.1038/sdata.2016.18
Wood-Charlson, E.M.; Crockett, Z.; Erdmann, C.; Arkin, A.P.; Robinson, C.B. (2022) Ten Simple Rules for Getting and Giving Credit for Data. PLoS Comput. Biol. 18, e1010476. https://doi.org/10.1371/journal.pcbi.1010476
ESIP Data Preservation and Stewardship Committee (2019). Data Citation Guidelines for Earth Science Data , Version 2. ESIP. Online resource. https://doi.org/10.6084/m9.figshare.8441816.v1
CRediT Available online: https://credit.niso.org/
Data Citation Synthesis Group Joint Declaration of Data Citation Principles. (2014) FORCE11. https://doi.org/10.25490/a97f-egyk

About the Authors

AJ Ireland

Elisha Wood-Charlson

Lawrence Berkeley National Laboratory