ETL to QE, Update 5, Optimizations
Date: 2023-10-06
Code Review and Optimizations
Today I showed off my ETL pipeline to a friend of mine and got their opinion on why it was slow. Together we tried 4 separate optimization of Python's psycopg2 3 of which sped up the inserting of data into the database. To test the optimisations we ran the insert script on the same machine with a reset database and got the number of rows in raw_messages_t at 60 and 120 seconds respectively.
Control Test
discordtest=# select count(*) from raw_messages_t;
count
-------
68998
(1 row)
discordtest=# select count(*) from raw_messages_t;
count
--------
125042
(1 row)
Commit less often
discordtest=# select count(*) from raw_messages_t;
count
--------
193886
(1 row)
discordtest=# select count(*) from raw_messages_t;
count
--------
328204
(1 row)
193886 / 68998 = 2.81 times faster at 1 minute compared to control 125042 / 328204 = 2.62 times faster at 2 minute compared to control
The previous code was committing almost every record independently. This makes sense when developing but not when actually processing millions of records. The key to committing 1000's of records when the tables have constraints is to use SQL's ON CONFLICT
syntax so the entire commit does not crash your program.
executemany
discordtest=# select count(*) from raw_messages_t;
count
--------
193886
(1 row)
discordtest=# select count(*) from raw_messages_t;
count
--------
405587
(1 row)
193886 / 68998 = 2.81 times faster at 1 minute compared to control 405587 / 125042 = 3.24 times faster at 2 minute compared to control
The values for the first minute being the exact same are because the script is writing to a series of tables not just raw_messages_t. Here we can see a slight improvement using psycopg2 executemany compared to raw execute. This makes sense because executemany takes an array of values and puts them in a single insert statement rather than doing insert statements separately.
execute_batch
discordtest=# select count(*) from raw_messages_t;
count
--------
485593
(1 row)
discordtest=# select count(*) from raw_messages_t;
count
--------
987465
(1 row)
485593 / 68998 = 7.03 times faster at 1 minute compared to control 987465 / 125042 = 7.89 times faster at 2 minute compared to control
execute_batch is where a serious speed up occurred.
JSONB rather than JSON
JSONB Test
discordtest=# select count(*) from raw_messages_t;
count
--------
405587
(1 row)
discordtest=# select count(*) from raw_messages_t;
count
--------
905420
(1 row)
405587 / 485593 * 100 = 83% the inserted records at one minute compared JSON type 905420 / 987465 * 100 = 91% the inserted records at two minute compared JSON type
This test was done as an optimization on top of the execute_batch test. Here we notice there is actually reduced performance compared to using raw JSON columns. This makes sense because JSONB is known for being faster to query therefore inserting it is a more complex task since the JSON keys need to bet parsed.
Notes for future testing
I am aware that this text is not as rigours as it probably should be. I should be using a script to run the select statements at the exact same time rather than a stopwatch. I should document what files I am inserted rather than just running the script. I should document how many files were actually inserted into the database. I could insert into only one table to get the raw performance of each change. I could generate synthetic data with a specific seed so the data and test are reproducible by other people. There are all thoughts I will keep in mind for future testing, the moral of the story is, commit as infrequently as possible, and using execute_batch makes a meaningful difference.