Let's Learn Web Scraping
Purpose
The goal is to write a web scraping tutorial using dentropydaemon-wiki/Software/List/Puppeteer to scrape dentropydaemon-wiki/Media/List/Fandom.com in its raw form for later processing in ETL pipelines.
Requirements
- dentropydaemon-wiki/Software/List/vscode
- Git Binding
- Javascript Document Object Model (DOM) Inspection
- dentropydaemon-wiki/Software/Catagories/Web Scraping
The code
dentropy/dentropys-web-scraping-tutorial
First Task
- Download a web page and save it.
- Do we want to download just the HTML or also including all the assets?
- For now let's just download the HTML
- How do we want to store the data, as files or within a database?
- For now let's just dump it to a folder
- If we are storing HTML, do we want to store it within JSON so that is includes website metadata?
- Yes
- How can we know if a site is dynamic or statuc
- Test if you can dentropydaemon-wiki/Software/List/curl or dentropydaemon-wiki/Software/List/wget it.
- What are different yet notable ways we can accomplish this task?
- dentropydaemon-wiki/Software/List/curl
- Works like a charm, even works with python via dentropydaemon-wiki/Software/List/python/subprocess
- dentropydaemon-wiki/Software/List/python/requests
- Works like a charm
- dentropydaemon-wiki/Software/List/pywebcopy
- Not idea, get's blocked by fandom.com and wikipedia.com
- dentropydaemon-wiki/Software/List/curl
- What web page do we want to use for example?
- fandom.com
- Do we want to download just the HTML or also including all the assets?
Second Task
- Extract all URL's from within web page
- We need to make a decision on how we want to scrape our data and commit to it
requests
is better documented on the internet, plus why complicate things
- How do we want to store this data?
- All these tutorials online just print out the URL's or store them as a list
- What is your purpose for scraping?
- I want an index of everything on fandom.com so I can play with it in ETL pipelines
- Then you need to decide on where you want to store your data
- I like the idea of doing it all in sqlite, maybe using the html extension
- The html sqlite-html extension will only make things more complicated, we are not worried about optimizing anything right now
- What information do we want to extract from HTML web pages
- Now we are designing a schema
- What if people do not know SQL
- JSON files give me a headache, plus sqlite has JSON and CSV exports out of the box
- Alright we can do sqlite no problem
- What is the base primitive?
- Random Primary Key, URL, DOMAIN, URL_PARSE_JSON, DATE_SCRAPED, HTML
- What other tables would we need?
- Well we want to generate graph's based on Fandom's and wiki's therefore those will likely have to be specific
- We are going to have to write a sqlite tutorial in here
- Yes we will
- I like the idea of doing it all in sqlite, maybe using the html extension
- Do we want to do raw HTML file form with JSON metadata, as well as sqlite
- YES
- Let's do raw HTML file then
- Is there room to use dentropydaemon-wiki/Software/List/Hamilton here for an ETL pipeline?
- Not yet, only after we have completed scraping
- What fandom page do we want to use as an example
- Why not dentropydaemon-wiki/Media/List/Person of Interest or dentropydaemon-wiki/Media/List/Westworld my favourite shows
- Let's do dentropydaemon-wiki/Media/List/Westworld because I finished watching it
- We have successfully scraped the URL's and stored them as a file, how do we want to deal with this transformation within sqlite?
- Do we create a separate table of URL's then have a table of scraped URL's?
- Yes that would make sense
- So do I just dump each URL into the URL's table if it does not exist?
- We just got a classical SQL problem, I want to link all the URL's from the page scraped to see how they interconnect but that is going to require a table in the middle to manage the relationship, so we would have URLS_T, SCRAPED_URLS_T, and PAGE_LINK_T
- PAGE_LINK_T can have root_url, linked_url, internal (boolean)
- Yes that would work
- This looks more like an ETL pipeline now
- Do we want to go full out and develop recursive scraping and shit
- NO NO NO, this is supposed to be a tutorial we want this as simple as possible
- Ah okay, so just URL's and SCRAPED_URLS
- That works for me then we can do the ETL stuff
- What was that other project that required a sqlite schema
- Keybase Binding
- Alright where is our schema
- Here
- SCRAPED_URLS_T
- SCRAPED_URL_ID
- URL_ID
- DATE_SCRAPED
- HTML
- URLS_T
- URL_ID
- RAW_URL
- DOMAIN
- URL_PARSE_JSON
- SCRAPED_URLS_T
- We need to make a decision on how we want to scrape our data and commit to it
Next Steps
- Recursively scrape web pages
- Track how the web pages are linked to one another
- How do I track the kind of links?
Additional Tasks
- Pay to get around Captias
- Create accounts with Email Authentication
- Reverse Engineer API
- HTTPS Interception and decoding in wireguard
- Detect and get around getting blocked
Research
- dentropydaemon-wiki/Software/Catagories/Web Scraping tools and dentropydaemon-wiki/Software/Catagories/Web Scraping Browser Extensions
- dentropydaemon-wiki/Wiki/Public Questions/What other some projects that scraped fandom.com
Logs
- 2023-03-17T22:13:19-04:00
- Alright how do I want to do this web scraping?
- I just want to give a wiki page then index EVERYTHING
- Where do we want to store EVERYTHING
- Why not just do SQLITE, it is idiot proof right
- Well if dentropydaemon-wiki/Software/List/Trilium Notes is going to store entire PDF files and images in there what is the worst I can do plus SQLite has other extensions.
- 2023-03-17T22:07:44-04:00
- So are we using Beautiful soup or Puppeteer
- Well if we can learn Beautiful soup in like an hour why not
- Alright
- 2023-03-17T21:38:31-04:00
- The goal is to write a web scraping tutorial using dentropydaemon-wiki/Software/List/Puppeteer to scrape dentropydaemon-wiki/Media/List/Fandom.com
- Alright what's required for that