Blog Logo
TAGS

How to crawl a quarter billion webpages in 40 hours

In this article, the author explains how they crawled 250,113,669 webpages in 39 hours and 25 minutes using 20 Amazon EC2 machine instances for just under 580 dollars. They also discuss the challenges of crawling a non-trivial fraction of the web and the ethical considerations surrounding web crawling. The author refrains from open-sourcing their crawler code due to concerns about its impact on website owners. The article concludes with a description of the architecture used in the crawling process.