BXT

Back

Mishaps in Redshift Temporary Tables

The database management system (DBMS) I’ll reference today is Redshift (AWS Cloud Datawarehouse offering) which is based on Postgres. I’ll dive into some of the anti-patterns around Temporary Tables I’ve seen abused during my work experience, why they don’t make sense, and how to fix them.

Lets get it out of the way… Temporary Tables are not “better” than Common Table Expressions (CTEs) and CTEs are not “better” than temporary tables. It is entirely dependent on the context, the data, the DBMS, the query, upstream tables, etc… because at the end of the day the stakeholder wants the 1MM row output in excel so they can make a pivot table. They don’t care if it was procured via CTEs, TEMP Tables, or that you put another hole in the home-office dry wall.

A few other notes:

In Redshift (and many other DBMS for that matter) there are plenty of excellent use cases for Temp Tables (I’ll provide a few at the end)
Be wary of Premature Optimization. If analytics workloads are running in an acceptable amount of time and not incurring unacceptable expenses, then effort is probably better spent elsewhere

Internally, most analytics queries we run scan TBs of data (Redshift Spectrum costs $5 per TB of data scanned from S3, but hey… screw it SELECT * it’s a business expense, Right!, Right?). Most commonly, large Transform or Extract jobs are a series of Temporary tables, there is nothing ~necessarily~ wrong with this. The issues come about when large chunks of unfiltered data are passed through this series of Temporary tables, being read and written over and over again.

SELECT statements encased in a CREATE TEMP TABLE AS will read data from the upstream source and write it to disk.
In a particular session these tables can be seen in STV_TBL_PERM, they will be collected after the session is closed (obviously permanent tables will not).
Repeatedly reading and writing data too and from temp tables throughout the course of a query can be costly both monetarily and temporally.

Issue 1: Create a temp table just to SELECT everything out of it⌗

Uselessly writing the entire result set to disk just to read it all back out. Just run the final SELECT, there is no need for the temp table. This also presents an additional opportunity for a statement to get stuck in the que behind other workloads.

-------------------------------------------- NOT GREAT, I SEE THIS WAY TOO OFTEN------------------------------------------CREATE TEMP TABLE FINAL AS (SELECT EVERYTHINGFROM SOME_SCHEMA.BIG_TABLE);SELECT *FROM FINAL-------------------- BETTER------------------SELECT EVERYTHINGFROM SOME_SCHEMA.BIG_TABLE

Issue 2: Unnecessary data and losing optimizations⌗

Redshift is a columnar database. Each column is stored separately, meaning if you don’t need a column Redshift can ignore scanning the data entirely. Yes, if you're using Spectrum for your OLAP queries and your data is sitting in S3 in abunch of .csv, Redshift will have to scan the whole file... but its probably time for Parquet? In the below SQL:

Redshift is forced to read and store columns BT.COL4, BT.COL5, BT.COL6 , BT.COL7 in TEMP_1 even though they are never referenced again in the rest of the query.
The query optimizer can’t parse further down in the query that all that was wanted were some fairly specific predicates; it could have skipped many blocks of data. Each of the select statements is on its own island in a way. Redshift stores data in 1MB blocks, with additional metadata about these blocks. This meta data (a.k.a zone maps) is what allows the DBMS to avoid scanning irrelevant blocks.

CREATE TEMP TABLE TEMP1 AS (SELECT BT.COL1 , BT.COL2 , BT.COL3 , BT.COL4 , BT.COL5 , BT.COL6 , BT.COL7FROM THISISA.BIGTABLE BT);------------------------------------------------------------------------------- Now that we’ve read everything and written to disk lets read it back out-----------------------------------------------------------------------------CREATE TEMP TABLE TEMP2 AS (SELECT BT.COL1 , BT.COL2 , BT.COL3 , ST.COL9FROM THISISA.BIGTABLE BTINNER JOIN TINY.SMALLTABLE ST ON BT.COL1 = ST.COL1 AND BT.COL2 = ST.COL2);-------------------------------------------------------------------------------------------- FINAL SELECT, lets read everything back out, after creating TEMP2 for… no reason?------------------------------------------------------------------------------------------SELECT *FROM TEMP_2WHERE COL1 = ‘I Should Have’ AND COL2 = ‘Applied these predicates’ AND COL3 = ‘Earlier

SQL is a declarative language, you indicate to the DBMS what you want and the DBMS goes off and determines the best way to retrieve that for you. The series of TEMP TABLES above augments this process and confines the DBMS to only performing rudimentary optimizations without broader context.

These optimizations can be implemented manually by the author of the query, but would otherwise be done automatically by the DBMS. Its like paying your chauffer a bunch of money to drive you to the airport, but, instead, you wrestle him for the keys, tie him up in the back seat of the car, turn on Maroon 5 (You love Maroon 5), and drive to the airport yourself? Why do that?

Issue 3: Sort Keys, Multiple Sort Keys???⌗

Something that seems to be fairly common is defining a SORT KEY within the CTAS syntax when the column used in the sort key should already have been filtered on in the first place during the creation of the Temp Table. This is essentially identical to my point above about filtering after the Temp Table Creation. Defining multiple sort keys at this point just makes things worse.

The kicker is that Redshift will attempt to determine the optimal SORY KEY and DIST KEY for any Temp Table based on the query plan. Manually specifying it in the temp table only to filter it out later is just a bit painful to see and flies in the face of aggregate early and often. More info can be found in the advanced table design playbook. This is probably a good time to mention that its important to make sure table statistics are up to date.

-------------------------------------------------------------------------------------- THE ADDITIONAL SORTING MAKES THE WRITE MORE EXPENSIVE WITH NO DOWNSTREAM BENEFIT-- WE'VE DISTRIBUTED THE DATA ACROSS THE NODES IN THE CLUSTER USING COL4------------------------------------------------------------------------------------CREATE TEMP TABLE TEMP_1 SORTKEY(COL1, COL2, COL3) DISTKEY(COL4) AS (SELECT BT.COL1     , BT.COL2     , BT.COL3     , BT.COL4     , BT.COL5     , BT.COL6     , BT.COL7     , …FROM THIS_IS_A.BIG_TABLE BT);SELECT BT.*FROM TEMP_1_SORT BTA        INNER JOIN ANOTHER_TABLE ATA ON BTA.COL4 = ATA.COL4 -- AT LEAST WE COLLOCATED THE JOIN COL...WHERE BTA.COL1 = ‘Yikes’ -- SHOULD HAVE FILTERED ABOVE;

In regard to distributing on COL4 someone may think:

“I want to save data in a Temp Table and specify a distribution key, so that I can improve my JOIN later on”

There is certainly some thought going into this argument, but there is no free lunch.

The redistributed data may improve the later join by collocating data across the cluster’s nodes, however, all that shuffling across different nodes still had to occur when the temp table was created!
While this is a valid argument, it would be important to benchmark the run times to actually determine what is most optimal here.
If the current query patterns don’t utilize the table’s partitions (sort keys) and rarely join on the column its distributed on… it might be time to think about adjusting the upstream table rather than having all subsequent queries cache the table in a temp table just to sort and redistribute it before further processing.

When are temp tables a good choice?⌗

If the temporary table is going to be read from numerous times, it may be best to filter once and read several times from the aggregated data. This was especially handy when Redshift used to lack ROLLUP, CUBE, and GROUPING SETS
Using Temp Tables to chunk up pieces of a query can be very handy in development, as they are persisted for the duration of the session, things like tests, logical checks, and quick aggregations can be run against them
Using a series of CTEs within a temp table can be a happy medium, doing more work and writing out less often
Honestly, I’d take 100 tmep tables over having to parse through some nested sub query mess

SELECT FROM( SELECT FROM( SELECT FROM( SELECT FROM WHERE (COL1 IN (SELECT FROM)) ) ))

source: https://selectfromwhereand.com/posts/redshift_temp_tables/

https://bxt.org/yihnb