JGI-KBase Co-development

Aligning services for our shared user community

The Joint Genome Institute (JGI), located at Lawrence Berkeley National Laboratory, is a Department of Energy Office of Science user facility that supports a broad variety of biological researchers and educators. Many KBase users benefit from JGI’s contributions, including data generated from community projects (link to types of proposals here) and public databases such as the Genome Portal and the MycoCosm and Phytozome genome resources, directly accessible from within KBase. KBase also provides users with the JGI assembly service (KBase login required), a precursor step prior to submitting data into JGI’s Integrated Microbial Genome and Microbiomes (IMG/M).

In addition to these data resources, JGI and KBase also share many data science goals. The alignment of production-quality, open source services enables scalable computing, a more efficient use of shared resources, and ensures comparability of data across our platforms.

Co-development Projects

Interactive Sequence Similarity Service

Sequence similarity search is a core requirement for biologists and data scientists working on biological data. The ability for users of our systems to quickly find data similar to data of interest is often at the foundation of many research questions, as it provides context and framing. With respect to the shared institutional goal of adhering to the FAIR (Findable, Accessible, Interoperable, and Reusable) data principles (Wilkinson et al., 2106), a sequence database that is shared across program creates results that are findable and accessible, facilitates interoperability, and enables reuse and reproducibility. JGI-KBase co-development team has adopted UniProt as a baseline, curated public database provider (e.g., UniRef100) to ensure that content generated by JGI and KBase are mapped to the same resource, enhancing the FAIRness and comparability of these data. The service is currently available through the JGI IMG/M portal’s GeneSearch, and is currently undergoing testing and integration at KBase.

ID Mapping Service

Internal to the projects, JGI and KBase are creating protein sequence mapping files based on non-redundant (NR) sequence files (e.g., IMG-NR, KBase-NR) that allows us to directly reference identical sequences across both platforms. The reference database is UniRef100. This mapping also allows NR files to be mapped to functional databases as well, including the Gene Ontology (GO) and InterPro, enabling consistent annotation results regardless of the platform users are working with.

Future work – Data Transfer Service between BER funded program

The transfer of data between BER funded programs often requires researchers to manually download and upload files. This takes time, adds the potential for error (both human and data integrity), and completely removes any citation/credit information for the data being transferred. We are revamping the Data Transfer Service (DTS) that connects KBase search to the JGI data portal. The new service will be platform agnostic, allowing any BER funded program to directly request or send data objects and their appropriate citation metadata. This service will ensure data object integrity, and will enable better tracking of data reuse across the BER funded portfolio, while also reducing the time and effort required by researchers to accomplish their science.