Skip to content

Let's Learn Web Scraping

Purpose

The goal is to write a web scraping tutorial using dentropydaemon-wiki/Software/List/Puppeteer to scrape dentropydaemon-wiki/Media/List/Fandom.com in its raw form for later processing in ETL pipelines.

Requirements

The code

dentropy/dentropys-web-scraping-tutorial

First Task

Second Task

  • Extract all URL's from within web page
    • We need to make a decision on how we want to scrape our data and commit to it
      • requests is better documented on the internet, plus why complicate things
    • How do we want to store this data?
      • All these tutorials online just print out the URL's or store them as a list
    • What is your purpose for scraping?
      • I want an index of everything on fandom.com so I can play with it in ETL pipelines
    • Then you need to decide on where you want to store your data
      • I like the idea of doing it all in sqlite, maybe using the html extension
        • The html sqlite-html extension will only make things more complicated, we are not worried about optimizing anything right now
      • What information do we want to extract from HTML web pages
      • Now we are designing a schema
      • What if people do not know SQL
      • JSON files give me a headache, plus sqlite has JSON and CSV exports out of the box
      • Alright we can do sqlite no problem
      • What is the base primitive?
      • Random Primary Key, URL, DOMAIN, URL_PARSE_JSON, DATE_SCRAPED, HTML
      • What other tables would we need?
      • Well we want to generate graph's based on Fandom's and wiki's therefore those will likely have to be specific
      • We are going to have to write a sqlite tutorial in here
      • Yes we will
    • Do we want to do raw HTML file form with JSON metadata, as well as sqlite
      • YES
    • Let's do raw HTML file then
    • Is there room to use dentropydaemon-wiki/Software/List/Hamilton here for an ETL pipeline?
      • Not yet, only after we have completed scraping
    • What fandom page do we want to use as an example
    • We have successfully scraped the URL's and stored them as a file, how do we want to deal with this transformation within sqlite?
    • Do we create a separate table of URL's then have a table of scraped URL's?
    • Yes that would make sense
    • So do I just dump each URL into the URL's table if it does not exist?
    • We just got a classical SQL problem, I want to link all the URL's from the page scraped to see how they interconnect but that is going to require a table in the middle to manage the relationship, so we would have URLS_T, SCRAPED_URLS_T, and PAGE_LINK_T
    • PAGE_LINK_T can have root_url, linked_url, internal (boolean)
    • Yes that would work
    • This looks more like an ETL pipeline now
    • Do we want to go full out and develop recursive scraping and shit
    • NO NO NO, this is supposed to be a tutorial we want this as simple as possible
    • Ah okay, so just URL's and SCRAPED_URLS
    • That works for me then we can do the ETL stuff
    • What was that other project that required a sqlite schema
    • Keybase Binding
    • Alright where is our schema
    • Here
      • SCRAPED_URLS_T
        • SCRAPED_URL_ID
        • URL_ID
        • DATE_SCRAPED
        • HTML
      • URLS_T
        • URL_ID
        • RAW_URL
        • DOMAIN
        • URL_PARSE_JSON

Next Steps

  • Recursively scrape web pages
  • Track how the web pages are linked to one another
  • How do I track the kind of links?

Additional Tasks

Research

Logs

  • 2023-03-17T22:13:19-04:00
    • Alright how do I want to do this web scraping?
    • I just want to give a wiki page then index EVERYTHING
    • Where do we want to store EVERYTHING
    • Why not just do SQLITE, it is idiot proof right
    • Well if dentropydaemon-wiki/Software/List/Trilium Notes is going to store entire PDF files and images in there what is the worst I can do plus SQLite has other extensions.
  • 2023-03-17T22:07:44-04:00
    • So are we using Beautiful soup or Puppeteer
    • Well if we can learn Beautiful soup in like an hour why not
    • Alright
  • 2023-03-17T21:38:31-04:00