Data Engineering SQL Tips
This can be an ongoing list of small SQL scripts I am reusing in various projects.
I. Validate backfilling. There are cases where we need to check if a backfilling procedure has finished correctly or if a silent failure has happened. We can quickly compare and check if our database has records for all dates.
This is a simple script that generates a list of ALL dates and checks to see if we have records for each date. This is run in Snowflake but of course we can adopt to other db servers also.
— ANY Missing data points
WITH CTE_MY_DATE AS (
SELECT TIMEADD(‘hour’, row_number() over (order by 1), ‘2022–09–01 00:00:00’) date_t
FROM TABLE(GENERATOR(ROWCOUNT => 12*24)))
select date_t from cte_my_date where date_t not in(
select partition_year||’-’||partition_month||’-’||partition_day||’ ‘||partition_hour||’:00:00.000' from page_speed.metrics group by 1
)
We can break the SQL into 2 parts. First, we create a CTE, and we generate a list of all dates. So the first part generates a list like:
DATE_T
2022–09–01 01:00:00.000
2022–09–01 02:00:00.000
2022–09–01 03:00:00.000
2022–09–01 04:00:00.000
2022–09–01 05:00:00.000
2022–09–01 06:00:00.000
2022–09–01 07:00:00.000
2022–09–01 08:00:00.000
2022–09–01 09:00:00.000
2022–09–01 10:00:00.000
2022–09–01 11:00:00.000
2022–09–01 12:00:00.000
2022–09–01 13:00:00.000
2022–09–01 14:00:00.000…..
Now we ask to get all date_t that do not exist in page_speed.metrics table.
II. 16 MB limit for snowflake columns
As per Snowflake documentation, the VARCHAR column can store only 16MB data. The VARIANT data type allows 16MB of compressed data, so that might give you more storage than a VARCHAR column, which limits to 16MB of uncompressed data. (source)
The above tip is for Native Snowflake tables. But in my experience more important is what happens to External tables or when we try to load a really big (more than 16MB json source file to an external table)
There is an excellent article and detailed tutorial here on how to load and query json data in Snowflake. I do not want to repeat here the process. But if you store in the json file cars.json an array of objects like:
[
{“make”: “Honda”, “model”: “Civic”, “year”: “2017”},
{“make”: “Ford”, “model”: “Focus”, “year”: “2020”}
]
you can query the staged data and the returned result will have 1 row for each object:
select * from cars_staged_table;Row $11 {“make”: “Honda”, “model”: “Civic”, “year”: “2017”}2 {“make”: “Ford”, “model”: “Focus”, “year”: “2020”}
Of course, we can now access the json fields directly with the following queries:
select PARSE_JSON($1):"make"::VARCHAR AS car,
PARSE_JSON($1):"model"::VARCHAR AS model,
PARSE_JSON($1):"year"::INTEGER AS year
from cars_staged_tableROW car model year
1 Honda Civic 2017
2 Ford Focus 2020
III. Query directly AWS S3
If you have created an external stage in snowflake its pretty simple to query the data in snowflake. I know that this trick is one of the most used Snowflake features but the ability to use SQL and query CSV, JSON, Avro, ORC, Parquet, or XML files is amazing and one of my superpowers.
The Snowflake documentation is pretty thorough and detailed and I am copying here their example that we use to read a Zipped json file with the following contents
{"a": {"b": "x1","c": "y1"}},
{"a": {"b": "x2","c": "y2"}}
So here is the script that queries and fetches even nested fields from a zipped json file.
-- Create a file format
create or replace file format my_json_format type = 'json';
-- Create an internal stage
create or replace stage mystage2 file_format = my_json_format;
-- Stage the data file
put file:///tmp/data1.json @mystage2;
-- Query the repeating a.b element in the staged file
select parse_json($1):a.b from @mystage2/data1.json.gz;
+--------------------+
| PARSE_JSON($1):A.B |
|--------------------|
| "x1" |
| "x2" |
+--------------------+
IV: Group timestamp by hours
It is really easy to group or select a date from a timestamp column:
select start_timestamp::DATE from table
But how about selecting or grouping by Hours? This seems a little more complex but we can use datediff Snowflake function that calculates how many hours have passed from a starting datetime.
select datediff(hour, ‘2022–09–01’, stat_time) diff_hours, count(*)
from page_speed.vitals
where stat_time::DATE<’2022–09–02'
and stat_time::DATE>=’2022–09–01'
group by 1
order by 1
The above query returns the count per hour for the selected time period:
Row diff_hours count
1 0 20
2 1 25
3 2 33
4 3 45
5 4 66
6 5 700
7 6 900
8 7 1200
9 8 2320
...
V. Check for Duplicates
This is a classic query that I am pretty sure everyone is using more than once per day. Let me add it here so that we can have a more complete list of tricks. So here we group by all the key fields and return the ones that appear more than once.
SELECT
browser,
country, device,
page,
page_type,
render_type,
stat_time,
stat_type,
vital,
stat_value,
COUNT(*)
FROM page_speed.vital
WHERE partition_year || '-' partition_month '-' || partition_day ='2022-09-01
GROUP BY
1,2,3,4,5,6,7,8,9,10
HAVING
COUNT(*) > 1