selnekovic.com

BigQuery tip – Updating data in sharded tables

If you work with sharded tables like GA4 export to BigQuery, you may encounter a situation where raw or source data needs to be updated across multiple tables. Manually tackling such updates can be tedious and error-prone.

How can we do it better? The answer is Dynamic Query Execution. The EXECUTE IMMEDIATE statement allows us to construct and run SQL commands on the fly, which is perfect for the sharded table updates.

Here is a practical example: imagine we need to anonymize user_id across all sharded tables within a specific date range. Below is a BigQuery SQL snippet. It demonstrates how to achieve this efficiently using scripting.

-- Change the query location settings in the BigQuery console as needed.

DECLARE date_suffix STRING; -- Holds the formatted date for each table.
DECLARE base_query STRING; 
DECLARE start_date DATE DEFAULT DATE '20241001'; -- Start date for the range.
DECLARE end_date DATE DEFAULT DATE '20241101'; -- End date for the range.
DECLARE new_user_id STRING DEFAULT 'anonymous'; -- New user ID to set.

-- Loop through each date between the start and end date.
FOR date_iter IN (
    SELECT FORMAT_DATE('%Y%m%d', day) AS date_suffix
    FROM UNNEST(GENERATE_DATE_ARRAY(start_date, end_date)) AS day
)
DO
  -- Construct the UPDATE query for the current date's table.
  SET base_query = FORMAT("""
    UPDATE `project.dataset.events_%s`
    SET user_id = '%s'
    WHERE user_id = 'uid_007';
  """, date_iter.date_suffix, new_user_id);

  -- Execute the constructed query.
  EXECUTE IMMEDIATE base_query;
END FOR;



This approach is an excellent example of how BigQuery’s advanced scripting capabilities can save time and simplify workflows.

Share the wisdom