Selecting Genomic data: using the Genomic Extraction tool

Selecting genomic variant data for analysis

To access our genomics data, you can utilize our point and click tools to extract variant data from the genomics dataset and save it as VCF (Variant Call Format) files for export to a Jupyter Notebook for analysis using Hail, PLINK, or other analysis tools of your choosing.  We also have VCF files located here: ($WGS_VCF_MERGED_STORAGE_PATH) and a Hail Matrix Table ($WGS_HAIL_STORAGE_PATH) for the entire dataset.  

Note: The extraction process described below should only be used when you want to analyze a smaller subset of participants (less than 5,000 participants).

Genomic Cohort Extraction Cost

Extracting a genomic cohort to VCF files will incur cost, similar to the costs accrued by your cloud analysis environment. Typically, this process will incur a cost of ~ $.02 / extracted sample, but cost may vary as WGS data size varies slightly across samples. Note that when running a genomic extraction on a cohort, only participants with corresponding WGS data will be extracted to the resulting VCF files. Likewise, only these participants will affect the cost. To see the exact number of participants with WGS data within your cohort, try adding a criteria requirement of “Whole Genome Sequence”.

As with other analysis in the workspace, costs are billed either to the workspace creator’s initial credits, or to the associated billing account. When using your own billing account, note that charges relating to VCF extraction will show up as “BigQuery Analysis” and can be identified by a label of “extraction_uuid” in the GCP billing export. Please note that cost/credits are not automatically refunded for canceled or failed extraction jobs.

Choosing the Genomics Dataset

For analysis of genomic data for a smaller cohort (5,000 participants or less), you can choose our prepackaged genomics cohort or concept sets, which consist of genomic variant data to include in your analysis.  However, for larger cohorts, we suggest you start with the prepackaged Hail Matrix Table or set of VCF files for the entire dataset.  

To choose the genomics dataset

  1. To build your cohort using the Cohort Builder tool, you can choose to use all participants, or create a custom cohort and save the cohort.
  2. Choose your saved cohort under Select Cohorts in the far left column,
  3. Check the box next to All whole genome variant data under Select Concept Sets (rows) in the middle column, and then 
  4. Choose VCF files under Select Values (columns) in the far right column.  
  5. Then click Create Dataset in the bottom right of the screen.

Genomic_Dataset_Builder.png

A pop up screen will appear, asking you to title and save your dataset:

Genomic_Dataset_Builder2.png

Next you will select Analyze at the bottom right of the screen to begin creating your Jupyter Notebook environment. A pop up screen will appear, asking if you would like to run the extraction process.  Please note that this process will utilize cloud compute credits in order to generate code and files from the genomic dataset which you can use in your analysis environment.  Be sure you are ready to begin this process, as it can be quite expensive, depending on the amount of data you are analyzing. Genomic data extraction will run in the background, and you will be notified when the files are ready for analysis.  

Alternatively, you can choose Skip and still save the Dataset on the next screen without beginning the extraction process or incurring any credit charges.

Genomic_Dataset_Builder4.png

You will then be asked to Export the Dataset to a Jupyter Notebook.  You can choose python or R coding languages, and the corresponding analysis tool(s) will be indicated.  You can select one of our recommended tools: Hail or PLINK, or choose Other VCF-compatible tool to generate a code snippet which simply retrieves the VCF files. Please note that it is recommended to use Python for the extraction process, and then create an R notebook or another Python notebook for the main analysis. It's a good practice to separate these 2 tasks (extraction and analysis).  Then you will select Export to begin the Jupyter Notebook environment.

Genomic_Dataset_Builder3.png

If you chose to run the extraction in the background while you were saving your dataset, you can check the status of your VCF extraction by clicking on the DNA icon in the help sidebar on the right side of the screen, and also see the full list of VCF files you have saved: 

Genomic_Dataset_Builder5.png

We encourage you to routinely check your Workbench account for costs being incurred while using the Workbench. To find out more information on how to check your account balance, optimize your workspaces to reduce costs, and for examples of project costs, please see the support articles below:  

Initial credits and how to create a billing account

Suggested Optimizations for Cost and Compute Time

Estimate how much your project will cost

How to work with All of Us Genomic data

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request