Skip to content

ETL to QE, Update 14, Topic Modeling

Date: 2023-10-19

See Discord Binding for project context,

ETL to QE has been at a stand still the last couple days as I try to prioritize and sort out its vision and roadmap. No code has been written since ETL to QE, Update 11, Posted Results on Discord while I have been coming up with new queries to run, analysis's to do, and features for the dashboard. I still have to answer simple questions like, Who is the dashboard for? What problem does it solve? and What are the different component parts? Today I thought through what the Topic Modelling feature will look like.

Think Through Topic Modeling

Topic Modelling can take in 1000's of discord messages and then create small lists of words that represent topics that were extracted from the discord messages. Getting something like BERTopic up and running taking in discord messages and outputting topics is pretty easy and strait forward. The problem of what to do with the topics after they are generated and how to do this on mass for the 26 million messages I have in my postgres database.

My vision for Topic Modelling is to feed every discord channel and user into multiple topic modelling software then develop an interface for labeling what topics and words from said topics provide notable insights. The end goal is to have all this inside a Graph Database so the topic data can be interrogated with human labeled data in a much more composable format. For now I am going to write a SQL schema to get a handel on how I am planning on playing out with the data.

-- Use DBML to define your database structure
-- https://dbdiagram.io/d
-- Docs: https://dbml.dbdiagram.io/docs

Table dataset {
  id VARCHAR [primary key]
  guild_id VARCHAR
  channel_id VARCHAR
}

Table raw_topic {
  id VARCHAR [primary key]
  num_key_words integer
  dataset_id VARCHAR
}

Table topic_to_keyword {
  id integer [primary key]
  keyword varchar
}

Table keywords {
  id varchar [primary key]
}

Ref: raw_topic.id > topic_to_keyword.id -- many-to-one

Ref: topic_to_keyword.keyword < keywords.id

Now let's walk through Topic Modeeling ETL steps

  • Get list of all channels
  • Get first x messages from channel where x is the number of messages that can be easily fed through the topic modelling software
  • Take every topic generated and insert into database making sure they are unique
  • Link every topic to the channel they were generated from
  • Take every word from each topic and upsert it into the database
  • For each word link it to both the channel as well as the topic it is from
  • Repeat same steps for each user with minimum threshold of messages, say 100