Workflows in the All of Us Researcher Workbench: Nextflow and Cromwell

Overview: This document describes workflow support in All Of Us Researcher Workbench (RWB). This includes guidance on when to use workflows and what workflows are.  

Audience: Researchers with Controlled Tier access who are processing large volumes of data. This document assumes basic familiarity with Google Cloud, including Cloud Storage and Docker.  

Notes: This document focuses on workflows for genomic data, but workflows are applicable to processing any data that is accessible in RWB.

Introduction

The Researcher Workbench supports two workflow engines, Nextflow (version 21.03.0-edge and above) and Cromwell/WDL (version 76 and above). Both of these workflow engines require knowledge of Google Cloud Storage, virtual machines (VMs), Google cloud cost model for both compute and storage, command-line interface (CLI) in RWB, and Docker to operate effectively.  

Within the Researcher Workbench, the Google Cloud Life Sciences API is the executor of Cromwell and Nextflow workflows. 

For an overview of batch processing in the Researcher Workbench (as well as limitations relevant to all batch processing), refer to Overview of Batch Processing

What are workflows?

Workflows consist of multiple processing steps that are performed by an external compute engine. A workflow typically includes defining analysis tasks, chaining them together, and parallelizing their execution. In the workbench, each task is run in a docker container, on its own virtual machine (VM), which is started when the task starts and shuts down when the task completes. Files are read from, and written to cloud buckets. Before a task starts, input files are copied to the VM (“localization”) and when a task completes running, output files are copied to a destination bucket (“delocalization”). Each task can have separate runtime characteristics, which describe the VM specs (eg, RAM, num CPUs) and the Docker image.   

Important features of workflows

Workflow engines provide automation for manual tasks that do not scale.  For example, the workflow engines detailed in this document do not require users to manually copy files from cloud buckets, as required by many analysis software packages (eg, PLINK, vcftools).  

Feature Cromwell? Nextflow? Why is this important? Notes
Localization of cloud files Yes Yes Users do not have to manage the file copying when running tasks.  For example, the file inputs for a task can be cloud URLs, since the workflow engine will automatically copy the file locally to the tool. Cromwell also supports optional localization when the tool itself can read files directly from a cloud location (eg, Genome Analysis ToolKit (GATK))
Automated parallelization Yes Yes The workflow engine will automatically figure out which tasks can be run in parallel.  Users need only map the inputs and outputs.  
Metadata Yes Yes Users can track the status of each task in a workflow, even after it completes.  
Output writes to a bucket Yes Yes Output of the workflows is saved when the cloud environment is deleted. This cannot be disabled.
Optional workflow and task inputs Yes Yes Allows for default values, including files.  
Separate output bucket for successfully completed workflows and failed workflows Yes No Keep outputs from failing workflows separate.  This will allow easier cleanup. Cromwell can automatically separate failed outputs to a separate directory than successful ones. 
Nextflow will output to buckets for successfully completed workflows, but cannot separate failed and successful workflows. 
Re-entrancy  Yes* Yes The workflow engine will not rerun successful tasks when a downstream task failed or was changed. Currently, call-caching is disabled for Cromwell in the Researcher Workbench (a file based DB is required and this is not yet enabled).  Checkpointing is enabled for Cromwell.
Call-caching and checkpointing can be enabled for Nextflow. 

 

Picking a workflow engine

If you decide to run a workflow within the Researcher Workbench, you have the choice of using Cromwell or Nextflow to execute that workflow. We recommend the following criteria to determine which workflow to use:

  1. If you already have a pipeline that uses Cromwell or Nextflow, we recommend starting with that pipeline. We recommend searching for existing pipelines that do what you want or is close enough that you can modify it to fit your use case. 

  2. If there are two equivalent pipelines that exist for both Nextflow and Cromwell, we recommend that you choose based on comfort level with each and what others in your institution use.

Workflow options in the Researcher Workbench

With the initial launch of workflow support in RWB, two workflow engines are available via Jupyter Notebook or Terminal: Cromwell and Nextflow. Please note that these workflow engines are not compatible with each other.

Cromwell + WDL

Cromwell is a Workflow Management System geared towards scientific workflows. Documentation for Cromwell can be found at the Cromwell wiki. Cromwell executes scripts written in a language called Workflow Description Language (WDL), a community-driven domain specific language (DSL) designed for data-intensive workflows. WDL allows users to define tasks, including scripts written in bash, and specify connections between tasks.

Refer to the Cromwell Tutorial Notebook for more guidance on getting started with Cromwell. This notebook will walk through:

  1. Setting up Cromwell within RWB

  2. Using GATK to validate variant call format (VCF) files

For information on the structure of WDL, please see Terra - Getting Started with WDL.

Nextflow

Nextflow is a workflow engine that uses a DSL (Groovy with workflow-specific extensions). Processes describe a task to be run, these can be written in any scripting language (supported by Linux) and include a task for each input set. Channels manipulate the flow of data from one process to the next. Workflows define the interaction between processes and channels. Documentation for Nextflow can be found at the Nextflow wiki.

Refer to the Nextflow Tutorial Notebook for more guidance on getting started with Nextflow. This notebook will walk through:

  1. Setting up Nextflow within RWB

  2. Using GATK to validate variant call format (VCF) files

Workflow Limitations 

For a full list of limitations applicable to all batch processing, refer to Overview of Batch Processing

Manual cleanup of runs

In the default configuration, both Cromwell and Nextflow will keep intermediate files in workflow runs. While this can be useful for debugging or as useful output, each workflow run will generate files, even if it fails. This will increase storage costs for data that may not be useful. We recommend periodically deleting the execution buckets of failed/obviated workflow runs once these are no longer useful.

Cromwell Specific Limitations

Cromwell within the Researcher Workbench does not support full Cromwell as a service functionality. Note that jobs are not tracked for your Cromwell workflow as would be expected when running a Cromwell specific server.

Nextflow Specific Limitations

Nextflow recommends only launching one Nextflow instance in a single directory at a time. More details are available here: Demystifying Nextflow resume

Suggestions for running workflows:

Managing timeouts when running a workflow

By default, RWB pauses after 30 minutes of inactivity. You can change the auto pause setting in the Cloud Analysis Environment panel to extend that timeout period, which we recommend doing while running genomic workflows.

Use screen with Cromwell 

You can use the screen command with Cromwell to allow Cromwell to run in the background. This will allow Cromwell to run if the notebook times out. Note that the deletion of a cloud environment will end the Cromwell process. Here are instructions via Terminal:

  1. screen -S cromwell

  2. start Cromwell

  3. Detach: while holding Control, press A and then D

  4. Do other things in terminal

  5. screen -ls - shows running sessions

  6. screen -r cromwell - reconnect to Cromwell

  7. Control D - exit session when Cromwell completes

Use screen with Nextflow 

You can use the screen command with Nextflow to allow Nextflow to run in the background. This will allow Nextflow to run if the notebook times out. Note that the deletion of a cloud environment will end the Nextflow process. Here are instructions via Terminal:

  1. screen -S Nextflow

  2. start Nextflow

  3. Detach: while holding Control, press A and then D

  4. Do other things in terminal

  5. screen -ls - shows running sessions

  6. screen -r Nextflow - reconnect to Nextflow

  7. Control D - exit session when Nextflow completes

FAQ

  • Do all tasks require a docker image?

Yes.  If the docker image is left blank, the task will run using the default image.  Please see FAQ #2 below

 

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request