A better way to crawl websites with PHP original
Our spatie/crawler. package is one of the first one I created. It allows you to crawl a website with PHP. It is used extensively in Oh Dear and our laravel-sitemap package.
Throughout the years, the API had accumulated some rough edges. With v9, we cleaned all of that up and added a bunch of features we've wanted for a long time.
Let me walk you through all of it!
Using the crawler
The simplest way to crawl a site is to pass a URL to Crawler::create() and attach a callback via onCrawled():
use Spatie\Crawler\Crawler; use Spatie\Crawler\CrawlResponse; Crawler::create('https://example.com') ->onCrawled(function (string $url, CrawlResponse $response) { echo "{$url}: {$response->status()}\n"; }) ->start();
The callable gets a CrawlResponse object. It has these methods
$response->status(); // int $response->body(); // string $response->header('some-header'); // ?string $response->dom(); // Symfony DomCrawler instance $response->isSuccessful(); // bool $response->isRedirect(); // bool $response->foundOnUrl(); // ?string $response->linkText(); // ?string $response->depth(); // int
The body is cached, so calling body() multiple times won't re-read the stream. And if you still need the raw PSR-7 response for some reason, toPsrResponse() has you covered.
You can control how many URLs are fetched at the same time with concurrency(), and set a hard cap with limit():
Crawler::create('https://example.com') ->concurrency(5) ->limit(200) // will stop after crawling this amount of pages ->onCrawled(function (string $url, CrawlResponse $response) { // ... }) ->start();
There are a couple of other on closure callbacks you can use:
Crawler::create('https://example.com') ->onCrawled(function (string $url, CrawlResponse $response, CrawlProgress $progress) { echo "[{$progress->urlsProcessed}/{$progress->urlsFound}] {$url}\n"; }) ->onFailed(function (string $url, RequestException $e, CrawlProgress $progress) { echo "Failed: {$url}\n"; }) ->onFinished(function (FinishReason $reason, CrawlProgress $progress) { echo "Done: {$reason->name}\n"; }) ->start();
Every on callback now receives a CrawlProgress object that tells you exactly where you are in the crawl:
$progress->urlsProcessed; // how many URLs have been crawled $progress->urlsFailed; // how many failed $progress->urlsFound; // total discovered so far $progress->urlsPending; // still in the queue
The start() method now returns a FinishReason enum, so you know exactly why the crawler stopped:
$reason = Crawler::create('https://example.com') ->limit(100) ->start(); // $reason is one of: Completed, CrawlLimitReached, TimeLimitReached, Interrupted
Each CrawlResponse also carries a TransferStatistics object with detailed timing data for the request:
Crawler::create('https://example.com') ->onCrawled(function (string $url, CrawlResponse $response) { $stats = $response->transferStats(); echo "{$url}\n"; echo " Transfer time: {$stats->transferTimeInMs()}ms\n"; echo " DNS lookup: {$stats->dnsLookupTimeInMs()}ms\n"; echo " TLS handshake: {$stats->tlsHandshakeTimeInMs()}ms\n"; echo " Time to first byte: {$stats->timeToFirstByteInMs()}ms\n"; echo " Download speed: {$stats->downloadSpeedInBytesPerSecond()} B/s\n"; }) ->start();
All timing methods return values in milliseconds. They return null when the stat is unavailable, for example tlsHandshakeTimeInMs() will be null for plain HTTP requests.
Throttling the crawl
I wanted the crawler to a well behaved piece of software. Using the crawler at full speed and with large concurrency could overload some servers. That's why throttling is a polished feature of the package.
We ship two throttling strategies. The first one is FixedDelayThrottle that can give a fixed delay between all requests.
// 200ms between requests $crawler->throttle(new FixedDelayThrottle(200));
AdaptiveThrottle is a strategy that adjusts the delay based on how fast the server responds. If the server responds fast, the minimum delay will be low. If the server responds slow, we'll automatically slow down crawling.
$crawler->throttle(new AdaptiveThrottle( minDelayMs: 50, maxDelayMs: 5000, ));
Testing with fake()
Like Laravel's HTTP client, the crawler now has a fake to define which response should be returned for a request without making the actually request.
Crawler::create('https://example.com') ->fake([ 'https://example.com' => '<html><a href="/about">About</a></html>', 'https://example.com/about' => '<html>About page</html>', ]) ->onCrawled(function (string $url, CrawlResponse $response) { // your assertions here }) ->start();
Using this faking helps to keep your tests executing fast.
Driver-based JavaScript rendering
Like in our Laravel PDF, Laravel Screenshot, and Laravel OG Image packages, Browsershot is no longer a hard dependency. JavaScript rendering is now driver-based, so you can use Browsershot, a new Cloudflare renderer, or write your own:
$crawler->executeJavaScript(new CloudflareRenderer($endpoint));
In closing
I'm usually very humble, but think that in this case I can say that our crawler package is the best available crawler in the entire PHP ecosystem.
You can find the package on GitHub. The full documentation is available on our documentation site.
This is one of the many packages we've created at Spatie. If you want to support our open source work, consider picking up one of our paid products.