Compare query time of parquet files with different compressions

2 min readJan 22, 2021

In this short article, we would like to do a quick comparison of fetching data from s3 datalakes storing data in parquet format with compression (either snappy or gzip)

We would like to compare Amazon Redshift and Presto query machines. Both execution engines can use clusters of machines that can distribute the execution of SQL commands.

Logistics

Our data set is a medium-sized dataset and is partitioned with time. We are generally interested in the time spent examining a specified folder. So let’s say we want to scan the following folder that contains 8 parquet files compressed with gzip of equal size of 35 MB of about 5 million records in total.

Redshift Cluster: 8 Nodes of dc2.large machines.

Presto Cluster: 15+1 Nodes of r5d.4xlarge

We are going to use the following 2 SQL queries.

A. Selecting the total count of records:

select count(1) from google.sessions_desktop_hits_unnest where dl_partition_year = ‘2021’ and dl_partition_month=’01' and dl_partition_day=’20'

B. Selecting one specific column

select fullvisitorid from google.sessions_desktop_hits_unnest where dl_partition_year = ‘2021’ and dl_partition_month=’01' and dl_partition_day=’20' and hitnumber >0

Gzipped Compression

Of course, it makes sense that Redshift is much slower than Presto since we use a much stronger cluster of machines for Presto.

Snappy Compression

The first thing to notice is that the file size has been increased to 55MB.

Conclusions

From the above comparisons, we cannot make solid conclusions about how does the compression algorithm affects the overall speed in querying. It seems that it has a small impact of slowing select queries but overall we cannot be sure that this was the main reason.

So why should we even consider using Snappy? Since Snappy compression is creating larger files, which means that we have an increase in storage cost and sometimes in transfer costs what is the main benefit. Well, Google created and uses Snappy compression algorithm because it is fast to compress and not very CPU intensive.

Compare query time of parquet files with different compressions

Logistics

Gzipped Compression

Snappy Compression

Conclusions

Written by Angelos Alexopoulos