Improving the performance of our PHP based crawler

Original – Dec 5th 2016 by Freek Van der Herten – 2 minute read

Today a new major version of our homegrown crawler was released. The crawler is used to power our http-status-check, laravel-sitemap and laravel-link-checker packages. A new major feature is the greatly improved crawling speed. This was accomplished by leveraging multiple concurrent requests.

Let's take a look at the performance improvements gained by using concurrent requests. In the video below, the crawler is started two times. On the left we have v1 of the crawler that just does one request and waits for the response before launching another request. On the right we have v2 that uses 10 concurrent requests. Both crawls will go over our entire company site https://spatie.be

Even though I gave v1 a little head start, it really got blown away by v2. Where v1 is constantly waiting on a response from the server, v2 will just launch another request, while it's waiting for responses of previous requests.

To make requests, the crawler package uses Guzzle, a the well known http client. A cool feature of Guzzle is that it provides support for concurrent requests out of the box. If you want to know more about on that subject, read this excellent blogpost by Hannes Van de Vreken. Here's the relevant code in our package.

Together with the release of crawler v2: these packages have new major versions that make use of crawler v2:

http-status-check: this command line tool can scan your entire site and report back the http status codes for each page. We use this tool whenever we launch a new site at Spatie to check if there are broken links.
laravel-sitemap: this Laravel package can generate a sitemap by crawling your entire site.
laravel-link-checker: this one can automatically notify you whenever a broken link is found on your site.

Integrating the crawler into your own project or package is easy. You can set a CrawlProfile to determine which urls should be crawled. A crawlReporter can be used to determine what should be done with the found urls. Want to know more? Then head over to the crawler repo on GitHub.

If you like the crawler, be sure to also take a look at the many other framework agnostic and Laravel specific packages our team has created.