BigQuery (GCP) - Information Schema

What you need to know to get started with INFORMATION_SCHEMA on BigQuery

ยท

4 min read

Introduction ๐Ÿ‘‹

I love that the GCP documentation is thorough, but at the same time, my brain occasionally freezes at the sheer number of pages I need to read to learn something. This post is going to give a concise overview of the INFORMATION_SCHEMA so you have the foundational knowledge to get started and be at a good place to expand your learning on this topic.

BQ Information Schema โ„น๏ธ

BigQuery has a set of views that provide metadata on your project's environment which can be obtained with SQL. Available metadata includes dataset, job, job timeline, reservation, streaming, routine, table, and view metadata. I will cover [some of these] in the post.

Note: Querying the INFORMATION_SCHEMA will incur a charge for bytes processed, but the main difference being that you will always be charged 10MB even if you scan fewer data. This is only for on-demand pricing customers. Additionally, query results are not cached so be aware of running queries multiple times.

Intro to INFORMATION_SCHEMA ๐ŸŽฌ

Qualifiers ๐Ÿ“‡

Querying the INFORMATION_SCHEMA must be associated with a dataset or region, which limits the results to a particular dataset or region, and these are called Qualifiers.

The format of selecting a metadata view should be in the format [project ID (optional)][dataset or region].INFORMATION_SCHEMA.[qualifier] and INFORMATION_SCHEMA and qualifiers should be in capital letters as it's case sensitive.

Basic Examples โŒจ๏ธ

These are a few examples of querying the INFORMATION_SCHEMA

# Select table metadata using TABLES
SELECT table_name, creation_time
FROM your_dataset.INFORMATION_SCHEMA.TABLES
# Select dataset metadata from SCHEMATA
SELECT catalog_name as project, schema_name as dataset, creation_time
FROM region-eu.INFORMATION_SCHEMA.SCHEMATA

Most of my data is stored in the EU region so I used it in the query. If you don't use a regional qualifier it defaults to us.

For a list of qualifiers visit this page

Table Metadata ๐Ÿ“Š

The Tables view makes it easy to obtain table names, their type and creation dates. This may be useful if you need to select the tables created this year or only native tables rather than views or external tables.

SELECT table_name, table_type, is_insertable_into, creation_time
FROM your_dataset.INFORMATION_SCHEMA.TABLES
WHERE EXTRACT(year from creation_time) = 2021
AND table_type = 'BASE TABLE'

If you need to grab information on the table metadata such as labels, description, table or partition expiry, tables that require a partition filter, then use the TABLE_OPTIONS view.

Note: Many objects also have a view where _OPTIONS is suffixed at the end to grab similar metadata like descriptions and labels.

# It's TABLE_OPTIONS not TABLES_OPTIONS
SELECT *
FROM your_dataset.INFORMATION_SCHEMA.TABLE_OPTIONS
WHERE option_name = 'require_partition_filter'

Columns Metadata ๐Ÿ“ฐ

If you need to list all columns for each table, get their data type, nullable status, or position in a table then use the COLUMNS view.

# Return all columns that are an integer or float
SELECT table_name, column_name, ordinal_position, data_type
FROM your_dataset.INFORMATION_SCHEMA.COLUMNS
WHERE data_type IN ('FLOAT64', 'INT64')

For more information on the Tables and Columns views

Job Metadata ๐Ÿ‘ฉโ€๐Ÿ’ป

Before running queries for Job Metadata ensure you have correct IAM permissions as certain qualifiers require specific permissions to work. The wider your scope (e.g User vs Organisation), the more permissions you need. Additionally, be mindful when running SELECT * queries as mine processed 2 - 4 TiB to view job metadata for my project with no filters.

Why would this be useful? Perhaps you want to know query times, cost of a query or look at the code itself.

# Define variables to be used in field calculation. It was easier to read the code this way
DECLARE gb_div INT64 DEFAULT (CAST(POWER(1024,3) AS INT64);
DECLARE tb_div INT64 DEFAULT (CAST(POWER(1024,4) AS INT64);
DECLARE cost_per_tb FLOAT64 DEFAULT 5.00;

SELECT project_id
,  job_id
, user_email
, creation_time
, start_time
, end_time
, TIMESTAMP_DIFF(end_time, start_time, MILLISECOND) as query_in_ms
, cache_hit
, total_slot_ms
, total_bytes_processed
, (total_bytes_processed / gb_div) as total_gigabytes_processed
, (total_bytes_procesed / tb_div) as total_terabytes_processed
, (total_bytes_procesed / tb_div)  * cost_per_tb as total_query_cost
, query
FROM region-eu.INFORMATION_SCHEMA.JOBS_BY_USER

The above query defines variables for calculating total GB and TB's processed and uses the TB value to generate a price per query.

There's a lot of fields available in the JOBS_BY_* view. Find the rest here

Conclusion ๐Ÿ’ก

These examples only scratch the surface of what is available in the INFORMATION_SCHEMA views. They are a great way of listing datasets, tables, columns, jobs and more with SQL that enable use cases such as knowing how many columns are in each table, how much each job in BigQuery costs and more. In my next posts, I'll talk about BigQuery Scripting and User Defined Functions for levelling up your BigQuery experience by enabling even more use cases.

Thank you for reading ๐Ÿ‘‹

Thank you for reading and I hope this helps demystify BigQuery's handy dandy INFORMATION_SCHEMA. Be on the lookout for more BigQuery related posts!

I would love to hear from you and any comments you have!