BigQuery - User Defined Functions and Scripting

A quick intro to UDFs and Scripting in BigQuery

ยท

8 min read

Introduction ๐Ÿ‘‹

These are two useful features of BigQuery that can either accelerate your queries or extend query capabilities to perform things such as loops.

User Defined Functions (UDFs) are user written functions that can be called in a similar way to using built-in BigQuery functions. Instead of writing a regular expression function over and over to clean a string, create an UDF once and call that function instead. Your colleagues can call the function too, which is a win-win!

Scripting enables users to write multi-statement SQL queries to run in a single execution and can incorporate coding logic such as loops and conditional statements.

User Defined Functions (UDF) ๐Ÿ‘ฉโ€๐Ÿ’ป

Introduction ๐Ÿ“˜

BigQuery has many useful built-in functions, but there may be code you're repeating that would be easier to use by calling it as a function. That's where User Defined Functions (UDF) comes in handy.

UDF can be:

  • SQL or JavaScript
  • Temporary or Persistent (although I think Persistent is much more useful)

SQL โŒจ๏ธ

Creating UDFs โœ๏ธ

The query structure for defining a persistent UDF is the following: CREATE FUNCTION [your_dataset].[desired_function_name](vars datatypes) AS (your code here)

Here are a few examples of creating SQL UDFs

-- Convert a string number with comma and return as an integer
CREATE FUNCTION my_dataset.parse_comma_number_as_int(num STRING) 
   OPTIONS (description="Convert a string number with comma and return as an integer") AS (
   CAST(REPLACE(num, ',', '') AS INT64)
);
-- Calculate area of a circle
CREATE OR REPLACE FUNCTION my_dataset.area_of_circle(pi FLOAT64, radius FLOAT64) RETURNS FLOAT64 AS (
   pi * POWER(radius, 2)
);

image.png

These functions will be placed in my_dataset. If you don't specify a project, like in these examples, they will be placed in the same project where the queries are run.

If you don't want to specify a datatype then use ANY TYPE and BigQuery won't throw an error about data types.

It's possible to create a temporary function by using CREATE TEMP FUNCTION instead, however I personally don't find them that useful. Persistent UDFs can be called by other users if they have IAM permission (bigquery.routines.get) to the dataset that stores them.

CREATE TEMP FUNCTION fumi_test.median_func(number ANY TYPE) AS (
   PERCENTILE_CONT(number, 0.5) OVER()
);

Calling UDFs ๐Ÿ“ฑ

To use the above UDFs we call them in a similar way to built-in functions except you will need to write the dataset where they are stored.

SELECT `my_dataset`.parse_comma_number_as_int("4,23423") as cleaned_number
, `my_dataset`.area_of_circle(3.14, 5) as area_of_circle

The result of this query is

image.png

It's also possible to reference persistent UDFs in another persistent UDF.

JavaScript โŒจ๏ธ

If you prefer to create a UDF in JavaScript or have a library you want to use then that's possible in BigQuery.

Creating UDFs โœ๏ธ

Here is an example from Google's documentation.

CREATE FUNCTION myFunc(a FLOAT64, b STRING)
  RETURNS STRING
  LANGUAGE js
  OPTIONS (
    library=["gs://my-bucket/path/to/lib1.js", "gs://my-bucket/path/to/lib2.js"]
  )
  AS r"""
    // Assumes 'doInterestingStuff' is defined in one of the library files.
    return doInterestingStuff(a, b);
 """;

SELECT myFunc(3.14, 'foo');

Similar to SQL UDFs we start by defining the name, arguments and return type. The main difference is declaring LANGUAGE js AS r""" [javascript code here] """. Everything in the triple quotes is JavaScript and that is what will be run. To include libraries use the OPTIONS (library=[path-to-lib]) section and this will make the libraries available in the JS code block.

If you only need a single line of JavaScript code simply use double quotes.

Authorised UDFs ๐Ÿ‘ฎ

Similar to Authorised Views you can let users run UDFs even if they don't have access to the dataset or tables that the UDF references. For example, you have a UDF that returns a field from tableA. Even if the user doesn't have access to tableA they can still use the authorised UDF.

UDFs Limitations ๐Ÿ›‘

UDFs are great, but be aware there are some limitations that either restrict what you can do or how much data can be processed.

Firstly, UDFs have rate limits and quotas so be aware of these.. Additionally, persistent UDFs have more rules such as length of function name and no duplicate names. Finally, JavaScript UDFs have more constraints than SQL.

JavaScript UDFs cannot:

  • Interact with DOM objects such as Window or Node
  • Use native code (e.g. code that is run in a browser and isn't explicitly called)
  • Process more than 5MB per row
  • Length of JavaScript library path cannot exceed 5000 characters

Also, be aware that a JavaScript UDF can time out and that will prevent your query from finishing.

Scripting ๐Ÿ‘ฉโ€๐Ÿ’ป

Introduction ๐Ÿ“˜

As mentioned earlier Scripting provides users additional functionality to support use cases that is challenging or even impossible with standard SQL alone such as running loops, nested conditional statements or dynamically creating SQL queries.

Simple Example ๐Ÿ”Ž

To start with BigQuery scripting lets look at a simple example.

DECLARE part_date INT64;

SET part_date = (
    SELECT EXTRACT(YEAR from max(some_date)) as max_date
    FROM `my-project.my-dataset.table-1` 
);

SELECT *
FROM `my-project.my-dataset.table-2`
WHERE year = part_date
LIMIT 10

The two key words used in this example are:

  • DECLARE - this initializes variables and their data type to be used later in the script. If you want to initialize with a default value use DEFAULT. An example would be DECLARE d DATE DEFAULT CURRENT_DATE();
  • SET - assigns a value or result of a query to variable(s)

It's important to end each statement with a `;'. When you run a script you will see the results as separate rows in the console.

This is a short example demonstrating how to use scripting. Of course this example could be rewritten to use a subquery in the WHERE clause so let's look at another example using more scripting features.

Another Example ๐Ÿ”

In the below example this script looks for a specific table from the INFORMATION_SCHEMA and grabs all column names that are either an integer or a boolean. Then these columns are used to dynamically create a SQL query using IF statements and a loop. Finally, the composed query is executed.

If you need a primer on the INFORMATION_SCHEMA then read my other blog post here!

# Query to grab all columns of a data type then generate a query to grab those dynamically
DECLARE column_array ARRAY<STRING>;
DECLARE num_columns INT64;
DECLARE dynamic_strings STRING DEFAULT "SELECT ";
DECLARE final_query STRING;

SET column_array = (
    SELECT ARRAY_AGG(column_name) as columns
    FROM `your_dataset_here.INFORMATION_SCHEMA.COLUMNS`
    WHERE table_name = 'your_table_name_here'
    AND data_type in ('INT64', 'BOOL')
);

SET num_columns = (SELECT ARRAY_LENGTH(column_array));

BEGIN 
    DECLARE start_int INT64;
    SET start_int = 0;
    LOOP 
        IF start_int < num_columns THEN
            IF start_int = 0 THEN
                SET dynamic_strings = CONCAT(dynamic_strings, column_array[OFFSET(start_int)]);
                SET start_int = start_int + 1;
            ELSE
                SET dynamic_strings = CONCAT(dynamic_strings, ", ", column_array[OFFSET(start_int)]);
                SET start_int = start_int + 1;
            END IF;
        ELSE 
            BREAK;
        END IF;
    END LOOP;
END;

SET final_query = CONCAT(dynamic_strings, " FROM `fumis_project.your_dataset_here.your_table_name_here`");

EXECUTE IMMEDIATE final_query;

This example is a bit longer than the previous one so let's break it down by each section.

Initialise Variables ๐ŸŽฌ

DECLARE column_array ARRAY<STRING>;
DECLARE num_columns INT64;
DECLARE dynamic_strings STRING DEFAULT "SELECT ";
DECLARE final_query STRING;

Initialise a few variables to be used in the entire script. The variable dynamic_strings begins composing the dynamic query by initialising it with the value SELECT. This will be used in the loop.

Set Variables โŒจ๏ธ

SET column_array = (
    SELECT ARRAY_AGG(column_name) as columns
    FROM `your_dataset_here.INFORMATION_SCHEMA.COLUMNS`
    WHERE table_name = 'your_table_name_here'
    AND data_type in ('INT64', 'BOOL')
);

SET num_columns = (SELECT ARRAY_LENGTH(column_array));

Grab column names based on the criteria I have given and also set the num_columns variable as this will be used in the LOOP.

Loops โžฐ

BEGIN 
    DECLARE start_int INT64;
    SET start_int = 0;
    LOOP 
        IF start_int < num_columns THEN
            IF start_int = 0 THEN
                SET dynamic_strings = CONCAT(dynamic_strings, column_array[OFFSET(start_int)]);
                SET start_int = start_int + 1;
            ELSE
                SET dynamic_strings = CONCAT(dynamic_strings, ", ", column_array[OFFSET(start_int)]);
                SET start_int = start_int + 1;
            END IF;
        ELSE 
            BREAK;
        END IF;
    END LOOP;
END;

This section of the query loops through each column in the column_array and concatenates all of them to dynamically generate our final SQL query.

There are a few new keywords here so lets go through them:

  • BEGIN and END; - Define a block of statements where any variables declared only exist until END. Essentially a local variable in your script. In the example start_ind is a local variable that isn't accessible outside of this block.
  • LOOP and END LOOP; - Iterate over a set of statements between up until END LOOP. You use either LEAVE or BREAK keyword to exit the loop. Usually paired with an if statement. It isn't shown in this example, but BigQuery also supports WHILE [condition] DO and END WHILE;
  • IF [condition] THEN and END IF; - Conditional statements for controlling flow and outputs depending on condition(s). It will execute the following expression when the statement is true. It can be extended with ELSEIF.

Execute the Dynamic Query ๐Ÿ”š

SET final_query = CONCAT(dynamic_strings, " FROM `fumis_project.your_dataset_here.your_table_name_here`");

EXECUTE IMMEDIATE final_query;

image.png

The final query is produced and we execute it using the EXECUTE keyword, which executes a dynamic SQL. In my example it's a standard query, however you can also run DDL or single DML statement.

There are great examples on Google Cloud's documentation on other ways of using this keyword, such as storing the result in variable(s) or generating a dynamic query with the keyword USING. Read more on those approaches here.

Thank you for reading ๐Ÿ‘‹

Thank you for reading and I hope this helps demystify BigQuery's Scripting and User Defined Functions. Be on the lookout for more BigQuery related posts!

I would love to hear from you and any comments you have!