ETL to QE, Update 5, Optimizations

Date: 2023-10-06

Code Review and Optimizations

Today I showed off my ETL pipeline to a friend of mine and got their opinion on why it was slow. Together we tried 4 separate optimization of Python's psycopg2 3 of which sped up the inserting of data into the database. To test the optimisations we ran the insert script on the same machine with a reset database and got the number of rows in raw_messages_t at 60 and 120 seconds respectively.

Control Test

discordtest=# select count(*) from raw_messages_t;
 count 
-------
 68998
(1 row)

discordtest=# select count(*) from raw_messages_t;
 count  
--------
 125042
(1 row)

Commit less often

discordtest=# select count(*) from raw_messages_t;
 count  
--------
 193886
(1 row)

discordtest=# select count(*) from raw_messages_t;
 count  
--------
 328204
(1 row)

193886 / 68998 = 2.81 times faster at 1 minute compared to control 125042 / 328204 = 2.62 times faster at 2 minute compared to control

The previous code was committing almost every record independently. This makes sense when developing but not when actually processing millions of records. The key to committing 1000's of records when the tables have constraints is to use SQL's ON CONFLICT syntax so the entire commit does not crash your program.

executemany

discordtest=# select count(*) from raw_messages_t;
 count  
--------
 193886
(1 row)

discordtest=# select count(*) from raw_messages_t;
 count  
--------
 405587
(1 row)

193886 / 68998 = 2.81 times faster at 1 minute compared to control 405587 / 125042 = 3.24 times faster at 2 minute compared to control

The values for the first minute being the exact same are because the script is writing to a series of tables not just raw_messages_t. Here we can see a slight improvement using psycopg2 executemany compared to raw execute. This makes sense because executemany takes an array of values and puts them in a single insert statement rather than doing insert statements separately.

execute_batch

discordtest=# select count(*) from raw_messages_t;
 count  
--------
 485593
(1 row)

discordtest=# select count(*) from raw_messages_t;
 count  
--------
 987465
(1 row)

485593 / 68998 = 7.03 times faster at 1 minute compared to control 987465 / 125042 = 7.89 times faster at 2 minute compared to control

execute_batch is where a serious speed up occurred.

JSONB rather than JSON

JSONB Test

discordtest=# select count(*) from raw_messages_t;
 count  
--------
 405587
(1 row)

discordtest=# select count(*) from raw_messages_t;
 count  
--------
 905420
(1 row)

405587 / 485593 * 100 = 83% the inserted records at one minute compared JSON type 905420 / 987465 * 100 = 91% the inserted records at two minute compared JSON type

This test was done as an optimization on top of the execute_batch test. Here we notice there is actually reduced performance compared to using raw JSON columns. This makes sense because JSONB is known for being faster to query therefore inserting it is a more complex task since the JSON keys need to bet parsed.

Notes for future testing

I am aware that this text is not as rigours as it probably should be. I should be using a script to run the select statements at the exact same time rather than a stopwatch. I should document what files I am inserted rather than just running the script. I should document how many files were actually inserted into the database. I could insert into only one table to get the raw performance of each change. I could generate synthetic data with a specific seed so the data and test are reproducible by other people. There are all thoughts I will keep in mind for future testing, the moral of the story is, commit as infrequently as possible, and using execute_batch makes a meaningful difference.