What the Struct are you talking about? These Nested Fields in BigQuery are confusing!

An introduction to the STRUCT and ARRAY data types in BigQuery

ยท

13 min read

Introduction ๐Ÿ‘‹

Have you ever stumbled upon a table in BigQuery which has columns with data for each row, but other columns only have one value with the rest as grey cells? Well, that is an example of an array, one of the data types in BigQuery. Additionally, have you seen a table where the column names are in the format [some_name].[col_name]? That is an example of a struct. This post will help clarify their purpose in BigQuery and how to use them.

Arrays ๐ŸŸช, ๐ŸŸจ, ๐ŸŸฆ

Introduction

An array is a data type in BigQuery that can store an ordered group of values in a list. The values it stores must all be the same type (e.g. [1,2,3]). The easiest way of creating an array in BigQuery is to use square brackets and declare your desired values like ['hello', 'goodbye', 'yes']. An array can contain structs, but they cannot contain arrays.

Creating an array manually may not be desirable, especially if you have many values, so there are plenty of built in array functions to use.

# This creates an array of dates by a 1 week interval
SELECT GENERATE_DATE_ARRAY('2021-04-11', '2021-06-11', INTERVAL 1 WEEK) as date_array

To access individual array elements use either OFFSET([num]) to grab an element at position using 0-based index. There is also ORDINAL([num]) to select an element by position using 1-based indexing.

# This example selects the first element and second element in the array
SELECT 
date_array[OFFSET(0)] as first_val
, date_array[ORDINAL(2)] as second_val
FROM (SELECT GENERATE_DATE_ARRAY('2021-04-11', '2021-06-11', INTERVAL 1 WEEK) as date_array)

Why bother?

They're a great way of adding additional information to a table even if the data is at a different level of granularity. Imagine if a person orders clothes from a retailer. That individual order will go through many stages to get to the customer - order received, packaged, sent, and delivered to purchaser. By embedding arrays into a table we can have information on all orders and the data on each stage and when it happened.

SELECT "Fumi" as customer, "123" as order_id, DATE(2021, 03, 05) as date_ordered, 
[STRUCT('RECEIVED' as status, DATE(2021, 03,05) as status_date), 
STRUCT('PACKAGED' as status, DATE(2021,03,06) as status_date), 
STRUCT('SENT' as status, DATE(2021,03,06) as status_date), 
STRUCT('DELIVERED' as status, DATE(2021,03,08) as status_date)] as status

The image below contains the results of the query. As previously mentioned for a single order we can see multiple status' and their dates. With Arrays we can create tables that contain different levels of data instead of referencing many tables for analysis.

image.png

Since arrays can store multiple values for each row you will most likely have to modify the format of the data to run certain queries.

SELECT "Fumi" as name, "Tetris" as game, [1, 4, 2, 8] as game_scores
UNION ALL 
SELECT "Fumi" as name, "Pacman" as game, [8, 6, 5, 10] as game_scores
UNION ALL 
SELECT "Peach" as name, "Tetris" as game, [3, 4, 7, 8] as game_scores

Lets say I have I have a table with score for each person for a particular game. Currently there are four scores per person per game. When selecting all of the data BigQuery will simply return the data - no fuss. However, if we want to write a query that returns the name of the player and their top score in descending order, BigQuery will throw an error. Certain functions cannot be applied to arrays or some queries cannot work when data is not at the same granularity, thus we must adjust our data by using UNNEST(). We use this function to flatten the array so that each element in the array is on an individual row.

image.png

Remember for each row in the table there is the games_score array. The example shows only three rows, and each row contains an additional four numbers to show different scores for the same game and person.

Lets write a query that uses UNNEST()

SELECT name, game, scores
FROM 
    (SELECT "Fumi" as name, "Tetris" as game, [1, 4, 2, 8] as game_scores
    UNION ALL 
    SELECT "Fumi" as name, "Pacman" as game, [8, 6, 5, 10] as game_scores
    UNION ALL 
    SELECT "Peach" as name, "Tetris" as game, [3, 4, 7, 8] as game_scores)
CROSS JOIN UNNEST(game_scores) as scores
ORDER BY scores DESC

The above query is telling BigQuery to take the games_score field and apply a correlated cross join (the short-hand is a , which you may see in other examples). This takes each row from the source table then joins in each element from the games_score field so there are four rows instead of one.

The result of the above query looks like this

image.png

There are many more array functions that are not in scope of this post. To view them please look at Google Cloud Platform's BigQuery documentation

Structs ๐Ÿ“ฆ

Introduction

The other data type is a STRUCT - this is used for storing nested data. A struct is a container of ordered fields with a type (required) and field name (optional). In many cases the data type can be determined from the inputs, so it's possible to create a struct by using parenthesis only. I personally prefer to use STRUCT as it's easier to read.

# a few examples
SELECT (1,2,3)
, ('a', 'b', 'c') as a_struct
, STRUCT('Peach' as name, 'cat' as type, 1 as age) as pet

The first two examples will add their own names, however it's much easier to use when you define names.

The struct can contain different data types and can contain structs and arrays.

SELECT STRUCT('Smokey' as name, 'cat' as type, 10 as age, ['Sleep', 'Treats'] as favourites) as pet
UNION ALL
SELECT STRUCT('Peach' as name, 'cat' as type, 1 as age, ['Catnip', 'Garden', 'Mice'] as favourites) as pet
UNION ALL
SELECT STRUCT('Ambear' as name, 'dog' as type, 8 as age, ['Walks'] as favourites) as pet

When querying a table with a struct you need to reference the struct name then the field name. To grab the names of all of the pets I would run SELECT pet.name FROM table1.

Why?

Firstly, if you have nested data types, like an XML or JSON, you may want to retain the format and that can be achieved with structs. Additionally, if you're running analytical queries on data that is spread across many tables, you are likely to be running many JOIN statements. In some cases this isn't an issue, but if you have a lot of data or have many joins (more than a few) then this can be time consuming and costly. By using Structs you can permanently store this data and categorise them neatly in your tables.

Nested and Repeated Fields โ›˜

We can combine the two data types ARRAY and STRUCT to create tables that can support data in semi-structured formats.

image.png

Before diving into an example there are two things to look out for in a schema to identify STRUCTS and ARRAYS.

  • If the TYPE is RECORD then it's a Struct.
  • If the MODE is REPEATED then it's an Array

If you see both RECORD and REPEATED then it's an array where each element contains a Struct.

SELECT STRUCT('Fumi' as owner, 'UK' as country) as owner
, [STRUCT('Peach' as name, 'cat' as type, 1 as age, ['Catnip', 'Garden', 'Mice'] as favourites)
, STRUCT('Smokey' as name, 'cat' as type, 10 as age, ['Sleep', 'Treats'] as favourites)] as pets
UNION ALL 
SELECT STRUCT('Barb' as owner, 'Canada' as country) as owner
, [STRUCT('Ambear' as name, 'dog' as type, 8 as age, ['Walks'] as favourites)
, STRUCT('Jo' as name, 'cat' as type, 5 as age, ['Humans', 'Meowing'] as favourites)] as pets

The schema for this data looks like this image.png

The data looks like this

image.png

For each owner they have one or more pets and there is additional information on each pet including their age and favourite activities. Accessing data from the owner struct is as simple as owner.owner, but to grab data from the pets struct, due to having an array we must unnest it.

image.png

Writing a query similar to the previous one in this post, if we use UNNEST we can flatten the array to grab the fields we want from the struct.

Audit Logs on GCP ๐Ÿ•ต๏ธ

The previous examples I provided are quite light in comparison to other tables using nested and repeated fields. One example is the audit logs that can be exported to BigQuery. This is helpful for businesses to monitor activity across GCP to ensure their environments are safe and used in an expected way. Creating this table is fairly straight forward - in Cloud Monitoring you create a sink that writes the activity to BigQuery. The resulting table uses a lot of nested and repeated fields and the table in the screenshot has over 100 columns! It can be quite intimidating if you don't use repeated or nested fields.

image.png

A simple query to extract a small subset of fields could look like this

SELECT 
protopayload_auditlog.methodName,
auth_info.permission,
auth_info.granted,
auth_info.resource
FROM `my_logs.cloudaudit_googleapis_com_activity` 
, UNNEST(protopayload_auditlog.authorizationInfo) as auth_info
WHERE DATE(timestamp) = "2021-03-11" 
LIMIT 1000

Which returns a result that looks like this

image.png

Thanks

I hope this helped clarify how to use Arrays and Structs and why you should consider incorporating them into your BigQuery tables. I would love to hear your comments or suggestions for other topics!