SQL Optimization

In this article, you will find helpful resources for utilizing Jupyter Notebooks within the Researcher Workbench. For more general information about Jupyter Notebooks, visit the beginner guide here. Currently within notebooks, users are able to analyze data using the following programming languages: R, Python, and SQL. When analyzing data in Jupyter Notebooks, consider implementing modes of query optimization to best reduce computational costs within notebooks. Below are suggestions for SQL optimization. Also, click here for another helpful article on SQL optimization, as well.

 

Use of  Google buckets to save my files in notebooks

It is important to permanently save your files in Google Buckets and to know how to call these data into your notebook for continued analysis. Doing this will save notebook loading time and add efficiency to your query further reducing the use of your available credits. Please watch the linked video for detailed instructions on how to do this. 

 

Use of OHDSI (Observational Health Data Sciences and Informatics) Standard Clinical Data Tables (GitHub link)

These tables show core information about the clinical events that occur longitudinally during valid Observation Periods for each Person, as well as demographic information for the Person. The image below provides an entity-relationship diagram highlighting the tables within the Standardized Clinical Data portion of the Observational Medical Outcomes Partnership (OMOP) Common Data Model. 

 

Picture1.png

 

SELECT fields instead of using SELECT*

When running exploratory queries, many SQL developers use SELECT* (read as “select all”) as a shorthand to query all available data from a table. However, if a table has many fields, this increases your costs by querying a lot of unnecessary data.

Using the SELECT statement will query only the data you need for your research. 

Inefficient:

     SELECT *

     FROM`{dataset}.measurement`

Efficient:

     SELECT

          person_id,

          measurement_date,

          measurement_concept_id,

          value_as_number

     FROM`{dataset}.measurement`

The second query is much cleaner and only pulls the required information for measurement.

To keep an index of all tables and field names, run a query from a system table such as INFORMATION_SCHEMA or ALL_TAB_COLUMNS.

 

Avoid SELECT DISTINCT (click here for helpful article) 

SELECT DISTINCT is one  way to remove duplicates from a query. SELECT DISTINCT works by GROUPing all fields in the query to create distinct results. To accomplish this goal however, a large amount of processing power is required. Additionally, data may be grouped to the point of being inaccurate. To avoid using SELECT DISTINCT, select more fields to create unique results.

 

Inefficient and inaccurate example:

     SELECT

        DISTINCT person_id

     FROM`{dataset}.measurement`

The query above does not account for multiple people with more than one measurement data type (e.g. height, weight, etc.). The person ID will be grouped together. In large databases, a large number of person IDs will cause this query to run slowly.

 

Efficient and accurate:

     SELECT

        person_id

     , measurement_concept_id

     , measurement_date

     FROM`{dataset}.measurement`

By adding more fields, unduplicated records were returned without using SELECT DISTINCT. The database does not have to group any fields, and the number of records is accurate.

 

Create joins with INNER JOIN (not WHERE)

Some SQL developers prefer to make joins with WHERE clauses, such as the following:

     SELECT

          person_id,

          measurement_source_concept_id,

          measurement_date,

          gender_concept_id

     FROM `{dataset}.person` p, `{dataset}.measurement` as m

     WHERE p.person_id = m.person_id

Use INNER JOIN instead:

     SELECT

          person_id,

          measurement_source_concept_id,

          measurement_date,

          gender_concept_id

     FROM `{dataset}.measurement` as m

     INNER JOIN `{dataset}.person` p on p.person_id=m.person_id

 

Use WHERE instead of HAVING to define filters

The goal of an efficient query is to pull only the required records from the database. Per the SQL Order of Operations, HAVING statements are calculated after WHERE statements. If the intent is to filter a query based on conditions, a WHERE statement is more efficient.

      SELECT 

          m_ext.src_id, 

          COUNT(1)

     FROM `{dataset}.measurement` m

     INNER JOIN `{dataset}.measurement_ext` m_ext on       

     m.measurement_id=m_ext.measurement_id

     GROUP BY m_ext.src_id

     HAVING m.measurement_date >= ‘2020-15-16’

In comparison, WHERE clauses limit the number of records pulled:

     SELECT 

          m_ext.src_id, 

          COUNT(1)

     FROM `{dataset}.measurement` m

     INNER JOIN `{dataset}.measurement_ext` m_ext on     

     m.measurement_id=m_ext.measurement_id

     WHERE m.measurement_date >= ‘2020-15-16’

     GROUP BY m_ext.src_id

HAVING should only be used when filtering on an aggregated field.

 

Use Wildcard at the end of phrase only

When searching plaintext data, such as cities or names, wildcards create the widest search possible. However, the widest search is also the most inefficient search.

When a leading wildcard is used, especially in combination with an ending wildcard, the database is tasked with searching all records for a match anywhere within the selected field.

Consider the query below to count rows by EHR site:

     SELECT

          mm.src_id,

          COUNT(1) as n

     FROM `{dataset}.device_exposure` as m

     LEFT JOIN `{dataset}.device_exposure_ext` as mm on m.device_exposure_id = mm.device_exposure_id

     WHERE lower(mm.src_id) like '%ehr site%'

Since EHR site ids all start with the words `ehr site` with no any other words before it, a more efficient query would be:

     SELECT

          mm.src_id,

          COUNT(1) as n

     FROM `{dataset}.device_exposure` as m

     LEFT JOIN `{dataset}.device_exposure_ext` as mm on m.device_exposure_id = mm.device_exposure_id

     WHERE lower(mm.src_id) like 'ehr site%'

 

Use LIMIT to sample query results

Before running a query for the first time, ensure the results will be desirable and meaningful by using a LIMIT statement (in some DBMS systems, the word TOP is used interchangeably with LIMIT). The LIMIT statement returns only the number of records specified. Using a LIMIT statement prevents taxing the production database with a large query, only to find out the query needs editing or refinement.

In the query below, we will examine a limit of 10 records:

     SELECT  

          concept_name, 

          count(distinct person_id) as n_participants

     FROM 

     `{dataset}.concept` 

     JOIN `{dataset}.concept_ancestor` on (concept_id=ancestor_concept_id)

     JOIN `{dataset}.observation` on (descendant_concept_id=observation_concept_id)

     JOIN `{dataset}.observation_ext` using(observation_id)

     WHERE observation_concept_id not in (40766240,43528428,1585389) --hcau vocab issues

     AND concept_class_id='Module'

     AND src_id='PPI/PM'

     AND questionnaire_response_id is not null

     GROUP BY 1

     ORDER BY n_participants desc

     LIMIT 10

You can see by the sample whether you have a usable data set or not.

 

Run your query during off-peak hours

In order to minimize the impact of your analytical queries on the production database, talk to a DBA (Database Administrator) about scheduling the query to run at an off-peak time. The query should run when concurrent users are at their lowest number, which is typically the middle of the night (3 a.m. – 5 a.m.).

The more of the following criteria your query has, the more likely of a candidate it should be to run at night:

  • Selecting from large tables (>1,000,000 records)
  • Cartesian Joins or CROSS JOINs
  • Looping statements
  • SELECT DISTINCT statements
  • Nested subqueries
  • Wildcard searches in long text or memo fields
  • Multiple schema queries

Initial Optimization Checks

  • pip install user-query-optimizer

Additional Tips

  • Avoid for loops (at all cost): these are extremely slow in both languages
  • Don’t reinvent the wheel. Take advantage of libraries such as data.table (CRAN introduction) or dplyr for R, and numpy and pandas for Python
  • Profile your code to find out which part of your codes is slowUsing approximate algorithms (approx_distinct() instead of COUNT(DISTINCT …))
  • Selecting the columns the user wants explicitly, rather than using (SELECT *)
  • Filtering on partitioned columns
  • Try to extract nested subqueries using a WITH clause.
  • Suggest filtering on most effective columns, by parquet-file ordering
  • Eliminate filtering overhead for partitioned columns
  • Replace UNION with UNION ALL if duplicates do not need to be removed
  • Aggregate a series of LIKE clauses into one regexp_like expression
  • Push down a complex join condition into a subquery
  • Specify GROUP BY targets with numbers for expressions
  • Instead of connecting to the database and creating a cursor for each row. If you move this step outside of the loop it will be much faster.

Was this article helpful?

1 out of 1 found this helpful

Have more questions? Submit a request