Let's Learn Web Scraping

Purpose

The goal is to write a web scraping tutorial using dentropydaemon-wiki/Software/List/Puppeteer to scrape dentropydaemon-wiki/Media/List/Fandom.com in its raw form for later processing in ETL pipelines.

Requirements

dentropydaemon-wiki/Software/List/vscode
Git Binding
Javascript Document Object Model (DOM) Inspection
- dentropydaemon-wiki/Software/Catagories/Web Scraping Browser Extensions
dentropydaemon-wiki/Software/Catagories/Web Scraping

The code

dentropy/dentropys-web-scraping-tutorial

First Task

Download a web page and save it.
- Do we want to download just the HTML or also including all the assets?
  - For now let's just download the HTML
- How do we want to store the data, as files or within a database?
  - For now let's just dump it to a folder
- If we are storing HTML, do we want to store it within JSON so that is includes website metadata?
  - Yes
- How can we know if a site is dynamic or statuc
  - Test if you can dentropydaemon-wiki/Software/List/curl or dentropydaemon-wiki/Software/List/wget it.
- What are different yet notable ways we can accomplish this task?
  - dentropydaemon-wiki/Software/List/curl
    - Works like a charm, even works with python via dentropydaemon-wiki/Software/List/python/subprocess
  - dentropydaemon-wiki/Software/List/python/requests
    - Works like a charm
  - dentropydaemon-wiki/Software/List/pywebcopy
    - Not idea, get's blocked by fandom.com and wikipedia.com
- What web page do we want to use for example?
  - fandom.com

Second Task

Extract all URL's from within web page
- We need to make a decision on how we want to scrape our data and commit to it
  - requests is better documented on the internet, plus why complicate things
- How do we want to store this data?
  - All these tutorials online just print out the URL's or store them as a list
- What is your purpose for scraping?
  - I want an index of everything on fandom.com so I can play with it in ETL pipelines
- Then you need to decide on where you want to store your data
  - I like the idea of doing it all in sqlite, maybe using the html extension
    - The html sqlite-html extension will only make things more complicated, we are not worried about optimizing anything right now
  - What information do we want to extract from HTML web pages
  - Now we are designing a schema
  - What if people do not know SQL
  - JSON files give me a headache, plus sqlite has JSON and CSV exports out of the box
  - Alright we can do sqlite no problem
  - What is the base primitive?
  - Random Primary Key, URL, DOMAIN, URL_PARSE_JSON, DATE_SCRAPED, HTML
  - What other tables would we need?
  - Well we want to generate graph's based on Fandom's and wiki's therefore those will likely have to be specific
  - We are going to have to write a sqlite tutorial in here
  - Yes we will
- Do we want to do raw HTML file form with JSON metadata, as well as sqlite
  - YES
- Let's do raw HTML file then
- Is there room to use dentropydaemon-wiki/Software/List/Hamilton here for an ETL pipeline?
  - Not yet, only after we have completed scraping
- What fandom page do we want to use as an example
  - Why not dentropydaemon-wiki/Media/List/Person of Interest or dentropydaemon-wiki/Media/List/Westworld my favourite shows
  - Let's do dentropydaemon-wiki/Media/List/Westworld because I finished watching it
- We have successfully scraped the URL's and stored them as a file, how do we want to deal with this transformation within sqlite?
- Do we create a separate table of URL's then have a table of scraped URL's?
- Yes that would make sense
- So do I just dump each URL into the URL's table if it does not exist?
- We just got a classical SQL problem, I want to link all the URL's from the page scraped to see how they interconnect but that is going to require a table in the middle to manage the relationship, so we would have URLS_T, SCRAPED_URLS_T, and PAGE_LINK_T
- PAGE_LINK_T can have root_url, linked_url, internal (boolean)
- Yes that would work
- This looks more like an ETL pipeline now
- Do we want to go full out and develop recursive scraping and shit
- NO NO NO, this is supposed to be a tutorial we want this as simple as possible
- Ah okay, so just URL's and SCRAPED_URLS
- That works for me then we can do the ETL stuff
- What was that other project that required a sqlite schema
- Keybase Binding
  - keybase-binding/ExportKeybase.py at main · dentropy/keybase-binding
- Alright where is our schema
- Here
  - SCRAPED_URLS_T
    - SCRAPED_URL_ID
    - URL_ID
    - DATE_SCRAPED
    - HTML
  - URLS_T
    - URL_ID
    - RAW_URL
    - DOMAIN
    - URL_PARSE_JSON

Next Steps

Recursively scrape web pages
Track how the web pages are linked to one another
How do I track the kind of links?

Additional Tasks

Pay to get around Captias
Create accounts with Email Authentication
Reverse Engineer API
HTTPS Interception and decoding in wireguard
Detect and get around getting blocked
- Web Scraping without getting blocked | ScrapingBee
- Web Scraping Without Getting Blocked | 12 Web Scraping Best Practices

Research

Logs

2023-03-17T22:13:19-04:00
- Alright how do I want to do this web scraping?
- I just want to give a wiki page then index EVERYTHING
- Where do we want to store EVERYTHING
- Why not just do SQLITE, it is idiot proof right
- Well if dentropydaemon-wiki/Software/List/Trilium Notes is going to store entire PDF files and images in there what is the worst I can do plus SQLite has other extensions.
2023-03-17T22:07:44-04:00
- So are we using Beautiful soup or Puppeteer
- Well if we can learn Beautiful soup in like an hour why not
- Alright
2023-03-17T21:38:31-04:00
- The goal is to write a web scraping tutorial using dentropydaemon-wiki/Software/List/Puppeteer to scrape dentropydaemon-wiki/Media/List/Fandom.com
- Alright what's required for that

Backlinks

My Projects