Types of All of Us data and how they are organized

Data Organization

All of Us data are organized into tables according to the Observational Medical Outcomes Partnership Common Data Model  (OMOP CDM) version 5.2, when possible. Extensive information regarding the OMOP CDM relational database can be found at www.ohdsi.org. Participant provided information (PPI), physical measurements (PM), and electronic health record (EHR) data are arranged into tables according to the OMOP CDM convention as shown below. Self-reported demographic data from the Basics survey (PPI survey) populates the Person table. Other data obtained from PPI surveys are found in the Observation table (see survey codebooks for more information about PPI surveys). Program physical measurements as well as EHR measurements populate the Measurement table. EHR data concerning visits, procedures, drugs, and conditions are arranged into their respective tables. All tables relate to the Person table and the tables containing procedure, drug, condition, and measurement data relate to the Visit table.

image-0.png   

When it comes to building a cohort using the Cohort Builder tool, you will find that the data are organized by “program data” and “domains.” Program data includes demographics, surveys, and physical measurements. EHR data are arranged by domain (conditions, procedures, drugs, measurements, and visits).

Program Data:

Demographics include age, gender, race, ethnicity, and deceased status. Demographic data are self-reported (collected via surveys) and subject to privacy methodology.

Surveys are questions and associated response options for surveys completed by participants.

All of the survey questions and potential answers are added to the All of Us specific PPI vocabulary and assigned a concept_id (source concept ID). When possible, the PPI concepts are then mapped to standard vocabularies such as Logical Observation Identifiers Names and Codes (LOINC), International Classification of Diseases (ICD), or Systematized Nomenclature of Medicine (SNOMED), and their associated “standard” concept_id. When mapping is not possible, the PPI concept_id serves as both the source and standard.

Survey questions and answers are stored in the Observational Medical Outcomes Partnership (OMOP) observation table and can be searched and analyzed via either the standard or source vocabulary concept_id.

For more information on how to use survey data in your research, see Resources for Survey Data Research and the How to Work with All of Us Survey Data notebooks in the Researcher Workbench tutorial workspaces.

Physical Measurements are taken at the time of participant enrollment, including blood pressure, heart rate, height, weight, body mass index (BMI), waist and hip circumference, pregnancy status, and wheelchair use.

Physical measurements are assigned an All of Us specific PPI concept_id (source concept ID). The PPI concepts are then mapped to the standard LOINC vocabulary and the associated “standard” concept_id and stored in the OMOP Measurements table. We recommend using the program-collected measurement data when possible. To distinguish between measurements recorded at enrollment versus those recorded in a participant’s EHR, use the source concept_ids. See the table below for more detail.

*Note: if you need physical measurements data, and you use measurement_concept_id, then you need to specify the data source found in the measurement_ext table. If you use the measurement_source_concept_id, then you don’t need to specify the data source. 

measurement_concept_id

Standard

measurement_concept_id

Source

3004249

Systolic blood pressure

903109

1st systolic blood pressure

3004249

Systolic blood pressure

903114

2nd systolic blood pressure

3004249

Systolic blood pressure

903130

3rd systolic blood pressure

3012888

Diastolic blood pressure

903110

1st diastolic blood pressure

3012888

Diastolic blood pressure

903129

2nd diastolic blood pressure

3012888

Diastolic blood pressure

903106

3rd diastolic blood pressure

3025315

Body weight

903121

Weight

3027018

Heart rate

903112

1st heart rate

3027018

Heart rate

903105

2nd heart rate

3027018

Heart rate

903108

3rd heart rate

3036277

Body height

903133

Height

40759207

Adult waist circumference protocol

903127

1st waist circumference

40759207

Adult waist circumference protocol

903134

2nd waist circumference

40759207

Adult waist circumference protocol

903128

3rd waist circumference

40765148

PhenX- hip circumference protocol

903117

1st hip circumference

40765148

PhenX- hip circumference protocol

903125

2nd hip circumference

40765148

PhenX- hip circumference protocol

903123

3rd hip circumference

For more information on how to use physical measurement data in your research, see the How to Work with All of Us Physical Measurement Data notebooks in the Researcher Workbench tutorial workspaces. 

Additionally, you may browse and/or download all of the All of Us concepts via ATHENA, the Observational Medical Outcomes Partnership (OMOP) community's searchable database of standardized vocabularies it supports. To browse All of Us PPI concepts, select "Vocabulary" in the left-side navigation bar and scroll to "PPI."

Electronic Health Records (EHR)

EHR data are transformed into standard vocabulary across 14 structured tables. Click here for information on how privacy rules may affect access to information within a participant's EHR and see the Data Dictionary for Registered Tier Curated Data Repository (CDR) for a detailed description of EHR data available within each table (listed below).

  • Person
  • Visit Occurrence
  • Condition Occurrence
  • Drug Exposure 
  • Measurement
  • Procedure Occurrence
  • Observation
  • Location*
  • Provider*
  • Device Exposure
  • Death
  • Care Site*
  • Fact Relationship
  • Specimen

      *Suppressed information

EHR Domains:

Conditions come from EHRs and are listed by ICD9, ICD10, or SNOMED standard codes.

Procedures come from EHRs and are listed by ICD9, ICD10, CPT, or SNOMED standard codes.

Drugs or medications come from EHRs and are listed by ingredient and organized by therapeutic uses according to the Anatomical Therapeutic Chemical (ATC) Classification System.

Measurements include laboratory tests and vital signs found in the EHR and are organized in the LOINC (Logical Observation Identifiers Names and Codes) code hierarchy.

Visits describe the type of facility where the participant received medical care (e.g., emergency room, outpatient, or inpatient).

Data Not Structured According to OMOP CDM

Wearable Device Data

Fitbit data are available in a series of four tables within the Registered Tier dataset, allowing researchers the ability to parse the data themselves. The following grid displays all currently available tables and associated fields.

Picture1.jpg

Below are the tables with the data format for each field, along with some notes to consider when using these data.

Daily Activity Summary

Picture2.jpg

Heart Rate

Picture3.jpg

Intraday Steps

Picture4.jpg

Considerations

  1. Daily summary data and daily goals for elevation (elevation, floors) are only included for users with a device that includes an altimeter.

  2. The steps field in Daily Active Summary entries is included only for activities that have steps (e.g. "Walking," "Running").

  3. Calorie burn goal (CaloriesOut) represents either dynamic daily target from the premium trainer plan or manual calorie burn goal. Goals are included to the response only for today and 21 days in the past.

  4. Calorie Count is the top level time series for calories burned inclusive of basal metabolic rate (BMR), tracked activity, and manually logged activities.

  5. Calories BMR only includes BMR calories.

  6. Activity Calories  are the number of calories burned during the day for periods of time when the user was active above sedentary level.

Genomic Data 

The All of Us genomic dataset contains whole genome sequencing (WGS) data and microarray genotype data (Array).  The genomic data is accessible through the Researcher Workbench.  Bucket locations, for accessing the data in analysis notebooks, can be found in the Controlled CDR Directory.  We provide variants in Variant Call Format (VCF), Hail MatrixTables (MT), and PLINK 1.9 bed/bim/fam triplets.  PLINK files are only provided for the array variants.  We provide the auxiliary tabular data, such as the joint callset QC flagged samples or related pairs, as tab-separated values (tsv), with the column headers in the first row.  

For more detailed information on how the genomic data are organized, please see this article.

Externally Sourced Socioeconomic Status Data
A selection of socioeconomic status summary statistics, sourced from the U.S. Census American Community Survey via a three digit zip code linkage, are made available within the Controlled Tier. These data are stored in an appended table and cover the following domains on a per Census block basis: proportion of population receiving assisted income benefits within the past 12 months, proportion of population aged 25 years or older with educational attainment of at least high school or GED equivalent, median household income in the past 12 months (in 2015 inflation-adjusted dollars), proportion of the population with no health insurance coverage, proportion of population with income below the federal poverty level within the past 12 months, proportion of houses that are vacant, and a deprivation index (see here for more info). Note: Participant level concepts for the following related data elements are available via the Basics survey: educational attainment, household income, and health insurance coverage.

Datasets Unlinked from the Registered Tier CDR

All of Us SARS-Co-V-2 Antibody Study 

Data from the All of Us SARS-CoV-2 Antibody study1 are available in a series of five tables that are unlinked to the Registered Tier CDR data (see tables below). These data are provided for the purpose of study replication and will not be updated with future Registered Tier CDR data releases. Please note, that there are considerations to keep in mind when reproducing this study: not all positive controls used in study analyses were able to be included in the datasets due to data use restrictions set by the sample provider and race and ethnicity are combined categories within the paper. For more information about replicating this research, please see the “How to Reproduce the All of Us SARS-CoV-2 Antibody Study” notebook found in the Featured Workspaces section of the Researcher Workbench.

The serology dataset is only contained in the All of Us Registered Tier Dataset v4 CDR R2020Q4R2 and, therefore, must be accessed through that CDR version, following the rules associated with accessing old CDR versions

Serology Person Table

Field Name

Field Description

Field Type

Enumerators

Registered Tier Rules

serology_person_id

A person id created specifically for the study 

   

Distinct id generated for this dataset; not linked in any way to research_id

person_id

     

Suppressed column in Registered Tier

state

State in which the individual/patient lives

string

 

Generalized state of residence for participants who reside in all non-US states and Washington DC into a single group (Guam, Palau, Puerto Rico, American Samoa, Micronesia, Marshall Island, Virgin Islands)

race

Based on individual’s self-reported race, generalized according to existing Registered Tier privacy requirements

string

 

Generalized based on Registered Tier CDR generalization rules

ethnicity

Based on individual’s self-reported ethnicity, generalized according to existing Registered Tier privacy requirements

string

 

Generalized based on Registered Tier CDR generalization rules

sex_at_birth

Based on individual’s self-reported sex, generalized according to existing Privacy requirements

string

 

Generalized based on Registered Tier CDR generalization rules

age

Individual’s age at date of specimen collection

numeric

 

Generalized participants greater than 89 years into one group

control_status

 

string

Positive / Negative / Non-Control

 

Serology Test Table

Field Name

Field Description

Field Type

Enumerators

Notes

test_id

A primary key of test table

     

sample_id

ID created for each sample tested

     

serology_person_id

Foreign key to the person table

     

test_code

Code corresponding to test_name

   

From the original flat files

test_name

Name of test (e.g. Abbott, EuroImmune, etc.)

   

From the original flat files

batch

Batch number

     

run_date_time

 

date/time

 

From the original flat files

instrument_name

     

From the original flat files

position

     

From the original flat files

Serology Results Table

Field Name

Field Description

Field Type

Enumerators

Notes

result_id

A primary key of Result

     

test_id

A foreign key linked to Test

     

result_name

     

From the original flat files

result_value

     

From the original flat files

Validation Results Table

Field Name

Field Description

Field Type

Enumerators

Notes

person_serology_id

     

From the original flat files

sample_id

       

roche_date

(Original field is “Final Date”) Date the test was reported in the Mayo system

   

From the original flat files 

roche_result

(COVTI) Test result for Roche Test

 

Positive / Negative / TNP / Pending

From the original flat files 

roche_raw_result

(COVTS) Raw data for Roche test results

 

Equal to or greater than 1 considered positive

From the original flat files

ortho_date

(Original field is “Final Date”) Date the test was reported in the Mayo system

   

From the original flat files

ortho_result

(VSARS) Test result for Ortho test

 

Positive / Negative / TNP / Pending

From the original flat files

ortho_raw_result

(SCO7) Raw data for the Ortho test result

 

Equal to or greater than 1 considered positive

From the original flat files 

Titer

Field Name

Field Description

Field Type

Enumerators

Notes

sample_id

 

numeric

   

serology_person_id

Foreign key to person

numeric

   

batch

 

numeric

 

From the original flat files

assay_type

     

From the original flat files

material

     

From the original flat files

test

     

From the original flat files

result

     

From the original flat files

comment

     

From the original flat files

References

1Althoff, K., Schlueter, D.J., Anton-Culver, H., Cherry, J., Denny, J., Thomsen, I., ... Schully, S. (in press). Antibodies to SARS-CoV-2 in All of Us Research Program participants, January 2 - March 18, 2020. Clinical Infectious Diseases, ciab519, https://doi.org/10.1093/cid/ciab519 

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request