Use dsub in the All of Us Researcher Workbench

Overview 

This document describes dsub support in All Of Us Researcher Workbench.   

Audience 

Researchers with Controlled Tier access who are processing large volumes of data.  This document assumes basic familiarity with the Google cloud, including buckets and docker.  

Notes

This document focuses on using dsub for genomic data, but it is applicable to processing any data that is accessible in Researcher Workbench.

Introduction 

dsub is a command-line tool that makes it easy to submit and run batch scripts in the cloud. With dsub, you can write a shell script and then submit it to a job scheduler from Jupyter. Unlike Cromwell WDL and Nextflow, dsub does not use a DSL. dsub supports Google Cloud as the backend batch job runner. Refer to the dsub documentation for additional guidance on writing dsub commands and more example dsub scripts. 

Refer to the dsub Tutorial Notebook for more guidance on getting started with dsub. This Notebook will walk through:

  • Set up dsub within the Researcher Workbench.
  • Best practices with dsub and how to debug dsub workflows.
  • Extract sample IDs from a VCF file (from Alpha 3) with dsub, using a bash script from the DataBiosphere repository.
  • Access PLINK files (from Alpha 3) in parallel and count the number of lines in each.

Within the Researcher Workbench, the Google Cloud Life Sciences API is the executor of dsub tasks. 

If you have any feedback or questions on using dsub, reach out to support@researchallofus.org 

Suggestion while running dsub tasks

With dsub, you can check the status of a job at any time by running the dstat command. This will work both in a Notebook and in a Terminal session. 

See https://github.com/DataBiosphere/dsub/blob/main/docs/troubleshooting.md for more examples.

Limitations

Only Google Container Registry is supported for docker images 

All tasks must run public docker images from Google Container Registry (GCR).  Therefore, dockerhub images will not work in Researcher Workbench workflows.  Typically, GCR docker URLs start with `us.gcr.io/` (eg, "us.gcr.io/broad-gatk/gatk:4.2.5.0" for the GATK 4.2.5.0 docker image in GCR) as opposed to a string (eg, “broadinstitute/gatk:4.2.0.0” for the GATK 4.2.5.0 in dockerhub).  

If you have a specific image that you want to use, and cannot find it in GCR, please reach out to our support team support@researchallofus.org.  

Docker container images in GCR

Each of these docker images are a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.  Below is a list of common tools and the associated public docker images in GCR that will work with dsub:  

Was this article helpful?

0 out of 1 found this helpful

Have more questions? Submit a request