Art project that aims to depict the vastness and colorfullness of the internet.
You can see the result of all the crawling and image-crunching at FavoriteIconsOfInternet.com
Our current goal is to bring the project to the state where we can keep the history of daily favicon changes for at least a million web sites.
This project uses Phantom Of The Cloud image to launch workers for parallelizable steps (3, 4, 6 and eventually 8), AWS auto-scaling groups can be used to speed-up or slow down processing.
Step 1. Load domains ✅
Updates a list of domains in the database, currently takes a list of Alexa Rankings.
Runs on central box. See steps_1_and_2.sh
Step 2. Get a list of domains to crawl ✅
Gets a list of domains to crawl (currently only active Alexa domains) and uploads them to a queue in chunks for crawlers to pick up
Runs on central box. See steps_1_and_2.sh
Step 3. Fetch icons ✅
Listens for messages in a queue and crawls the sites in the message finding favorite icons and comparing them to existing version to see if the have changed.
Runs on crawler workers. See steps_3_and_4.sh
Step 4. Convert icons to PNG ✅
After all icons are fetched, convert them to PNG, calculate average color and upload to results storage together with manifest describing which icons are new, which has changed and etc.
Runs on crawler workers. See steps_3_and_4.sh
Step 5. Calculate tiles to be updated ✅
Gather all the results and update the database. Calculate a list of tiles that need to be updated (currently all tiles with predefined width/height ordered by Alexa ranking) and put each tile as a job into a queue.
Generate HTML and necessary JSON metadata.
Runs on central box. See step5.sh
Step 6. Generate tiles 🔴
Grab images required for the tile (or sync them all) and generate a tile. Optimize the image using smu.sh and deploy to a CDN.
Runs on tile workers. TBD (To Be Developed)
Step 7. Move HTML and metadata to production 🔴
Once all tiles are done, move HTML and metadata chunks over to production!
Runs on central box. TBD (To Be Developed)
Step 8. Send emails, daily reports and etc 🔴
Notify users (if any), send daily newsletter and etc.
Runs on central box (and SMTP workers if load is high). TBD (To Be Developed)