Skip to content

ETL to QE, Update 1, SQLite to Postgres

Date: 2023-10-02

See Discord Binding for project context

Project Context TLDR; I am trying to use the output of DiscordChatExporter which exports data from Discord Guilds to JSON files and put it into a relational database.

Removing SQLite from ETL Pipeline

Today I practiced running my ETL pipeline from end to end. During this process I discovered going from json files to sqlite to postgres is not ideal. The reasons for converting to sqlite rather then directly to postgres were the following,

  1. It is easier to load in a sqlite file inside jupyter notebooks than host and connect to a Postgres database
  2. One can only load the particular discord guild's they want analytics on into Postgres or any of a list of other Data Engineering Tools
  3. It is much easier to share a sqlite file with someone then share a dump from Postgres

The extra step in the procedure was making things much more complicated than they needed to for example here are a list of problems I had,

  1. Conversion of type timestamp from sqlite to Postgres did not work. The unix time integer in the sqlite database could not be cast directly to type timestamp in postgres without causing an error. The solution I came up with was to have a raw_unix_timestamp column of type Integer, creating additional columns in said table of of name unix_timestamp of type timestamp, then casting the unix_time time stamp from sqlite.
  2. Going from JSON tables to native SQL tables was a bad idea. My Python script reformatted all the JSON files from DiscordChatExporter into a series of python dictionaries that easy to work with. I thought it would be simple to just dump those python dictionaries into JSON tables inside my database of choice then perform the transformations in native SQL. The problem is that these transformations now need to be written for SQLite SQL and PostgresSQL because these database have different JSON functions. This also added the complexity where a decision has to be made on which database does the transformation. It would be better to go strait to native SQL rather than using JSON SQL tables.
  3. Each discord guild had a separate SQLite database therefore if one wanted to do analysis on two of them at the same time they had to wither write complex connecting code or merge the SQLite databases together. Merging the SQLite databases together is possible though a hassle adding additional complexity. I realized I can just put all the data into postgres and output whatever data I want to SQLite if I really wanted to.
  4. I learned about DuckDB as an alternative to SQLite, DuckDB has a built in Postgres connector so I expect creating a portable dataset from Postgres into DubkDB will be much easier than SQLite. DuckDB can even read Parquet files, a file format I should probably be using and I will be coming back to....

Experience Running ETL pipeline on M1 Macbook

On my desktop I have about 10,000,000 messages loaded into my Postgres database. I got the same raw JSON files and ran the same script to load the messages into a Postgres database on my M1 Macbook. After about an hour and a half I had only loaded about 2.5 million messages and I cancled the rest of the data transformation because I had enough data in the database to perform test analytics queries. This was taking a lot longer than I expected, for example here are some thoughts I had.

  1. SQLite running on Disk took much longer than loading into Postgres
  2. SQLite running in memory and saving to disk afterwards was ridiculously fast taking about an hour to load everything, of course each guild has their own database rather than having a single database would affect this timing
  3. Migrating the data from SQLite to Postgres was faster than loading in the raw JSON data I expect this is because of a couple different factors.
    1. The data from SQLite into Postgres is Native SQL not JSON SQL tables
    2. The batches committed to Postgres in the SQLite transformation were larger allowing for postgres to optimize its inserts
    3. Postgres is writing everything to Disk, maybe there are settings I can set, like giving it more ram, what will speed up this process