#tips

Instead of crawling popular sites directly crawl a frontend instead ? maybe not the best missing data ?
https://farside.link/

puppeteer will always be the best since it's developed by google and they use it for their scraping.

use the sitemap which every website should have. you can always get that with curl since its just an xml file.
then use puppeteer to crawl the pages you need.

puppeteer will get everything. its a headless google chrome instance. make sure your code keeps it running instead of launching it for every individual url which is inefficient.


run into CAPTCHAs and the best way to overcome this is a mix of residential IP proxies and CAPTCHA solving services
If you want to avoid CAPTCHAs et al but don't want to use external services, https://github.com/ultrafunkamsterdam/undetected-chromedriver

In order to automatically support multiple websites and to be resistant to website changes breaking your scraping script, find information without relying on id/classname. For instance if you are looking to collect information on each of a list of items, sort elements by child count (maybe also with a filter for minimum text content length) to find the one where everything branches out, then iterate the child elements applying regex to their textContent, looking at the tag attributes, stuff like that which does not depend on the variable names the devs or javascript compilers have chosen.


https://simonwillison.net/2020/Oct/9/git-scraping/


https://proxidize.com/


puppeteer vs playwright  ?