JGI-KBase Co-development

Aligning services for our user community

The Joint Genome Institute (JGI), located at Lawrence Berkeley National Laboratory, is a Department of Energy Office of Science user facility that supports a broad variety of biological researchers and educators. Many KBase users benefit from JGI’s contributions, including data generated from community projects (link to types of proposals here) and public databases such as the Genome Portal and the MycoCosm and Phytozome genome resources, directly accessible from within KBase. KBase also provides users with the JGI assembly service (KBase login required), a precursor step to submitting data into JGI’s Integrated Microbial Genome and Microbiomes (IMG/M).

In addition to these data resources, JGI and KBase share many data science goals. The alignment of production-quality, open source services enables scalable computing, a more efficient use of shared resources, and ensures comparability of data across our platforms.

Co-development Projects

Interactive Sequence Similarity Service

Sequence similarity search is a core requirement for biologists and data scientists working on biological data. The ability for users of our systems to quickly find data similar to their data of interest is often at the foundation of many research questions, as it provides context and framing. With respect to the mutual institutional goal of adhering to the FAIR (Findable, Accessible, Interoperable, and Reusable) data principles (Wilkinson et al., 2106), a sequence database that is shared across programs creates results that are findable and accessible, facilitates interoperability, and enables reuse and reproducibility. JGI-KBase co-development team has adopted UniProt as a baseline, curated public database provider (e.g., UniRef100) to ensure that content generated by JGI and KBase are mapped to the same resource, enhancing the FAIRness and comparability of these data. The service is currently available through the JGI IMG/M portal’s GeneSearch, and is undergoing testing and integration at KBase.

GitLab open source repository: GeneSearch

ID Mapping Service

Internal to the projects, JGI and KBase are creating protein sequence mapping files based on non-redundant (NR) sequence files (e.g., IMG-NR, KBase-NR) that allows us to directly reference identical sequences across both platforms. The reference database is UniRef100. This utility also allows NR files to be mapped to functional databases including the Gene Ontology (GO) and InterPro, enabling consistent annotation results regardless of platform.

GitLab open source repository: Seqidmap

Data Transfer Service between BER funded program

The BER Data Transfer Service will make it easy to find and move files between data platforms, whilst ensuring people, organizations, and funders get credit for their contributions.

The transfer of data between BER funded programs often requires researchers to manually download and upload files. This takes time, adds the potential for error (both human and data integrity), and completely removes any citation/credit information for the data being transferred. We are revamping an existing service that connects KBase search to an older JGI data portal.

The newly funded effort to build a BER Data Transfer Service (DTS) enables us to create a shared service that is platform agnostic. Once fully integrated, any participating data platform will be able to directly request (or send) data files, along with citation metadata, to another connected platform. This service will ensure data integrity, enable better tracking of data reuse across BER and agency funded data platforms, and also reduce the time and effort required by researchers to accomplish their science.