Oh Dear is the all-in-one monitoring tool for your entire website. We monitor uptime, SSL certificates, broken links, scheduled tasks and more. You'll get a notifications for us when something's wrong. All that paired with a developer friendly API and kick-ass documentation. O, and you'll also be able to create a public status page under a minute. Start monitoring using our free trial now.

Building a crawler in PHP

Original – by Freek Van der Herten – 4 minute read

When Spatie unleashes a new site on the web we want to make sure that all, both internal and external, links on it work. To facilitate that process we released a tool to check the statuscode of every link on a given website. It can easily be installed via composer:

composer global require spatie/http-status-check

Let's for example scan the Laracasts.com-domain:

http-status-check scan

Our little tool will spit out the status code of all links it finds: screenshot

And once finished a summary with the amount of links per status code will be displayed.

The package uses a home grown crawler. Sure, there are already many other crawlers available. I built a custom one part as a learning exercise, part because the other crawlers didn't to exactly what I wanted to.

Let's take a look at the code that crawls all links on a piece of html:

/**
 * Crawl all links in the given html.
 *
 * @param string $html
 */
protected function crawlAllLinks($html)
{
    $allLinks = $this->getAllLinks($html);

    collect($allLinks)
        ->filter(function (Url $url) {
            return !$url->isEmailUrl();
        })
        ->map(function (Url $url) {
            return $this->normalizeUrl($url);
        })
        ->filter(function (Url $url) {
            return $this->crawlProfile->shouldCrawl($url);
        })
        ->each(function (Url $url) {
            $this->crawlUrl($url);
        });
}

So first we get all links. Then we'll filter out mailto-links. The next step normalizes all links. After that we'll let a crawlProfile determine if that link should be crawled. And finally the link will get crawled.

Determining which links there are on a page may sound quite daunting but Symfony's DomCrawler makes that very easy. Here's the code:

protected function getAllLinks($html)
{
    $domCrawler = new DomCrawler($html);

    return collect($domCrawler->filterXpath('//a')
        ->extract(['href']))
        ->map(function ($url) {
            return Url::create($url);
        });
}

The DomCrawler returns strings. Those strings get mapped to Url-objects to make it easy to work with them.

On a webpage protocol independent-links (eg. //domain.com/contactpage) and relative links (/contactpage) may appear. To make our little crawler needs absolute links (https://domain.com/contactpage) so all links need to be normalized. The code to do that:

protected function normalizeUrl(Url $url)
{
    if ($url->isRelative()) {
        $url->setScheme($this->baseUrl->scheme)
             ->setHost($this->baseUrl->host);
    }
    if ($url->isProtocolIndependent()) {
        $url->setScheme($this->baseUrl->scheme);
    }
    return $url->removeFragment();
}

$baseUrl in the code above contains the url of the site we're scanning.

Determining if a url should be crawled

The crawler delegates the question if a url should be crawled to a dedicated class that implements the CrawlProfile-interface

namespace Spatie\Crawler;

interface CrawlProfile
{
    /**
     * Determine if the given url should be crawled.
     *
     * @param \Spatie\Crawler\Url $url
     *
     * @return bool
     */
    public function shouldCrawl(Url $url);
}

The package provides an implementation that will crawl all url's. If you want to filter out some url's there's no need to change the code of the crawler. Just create your own CrawlProfile-implementation.

Crawling an url

Guzzle makes fetching the html of an url very simple.

$response = $this->client->request('GET', (string) $url);
$this->crawlAllLinks($response->getBody()->getContents());

There's a little bit more to it, but the code above is the essential part.

Observering the crawl process

Again, you shouldn't touch the code of the crawler itself to add behaviour to it. When instantiating the crawler it expects that you pass it a implementation of CrawlObserver

Looking at the interface should make things clear:

namespace Spatie\Crawler;

interface CrawlObserver
{
    /**
     * Called when the crawler will crawl the url.
     *
     * @param \Spatie\Crawler\Url $url
     */
    public function willCrawl(Url $url);

    /**
     * Called when the crawler has crawled the given url.
     *
     * @param \Spatie\Crawler\Url                      $url
     * @param \Psr\Http\Message\ResponseInterface|null $response
     */
    public function hasBeenCrawled(Url $url, $response);

    /**
     * Called when the crawl has ended.
     */
    public function finishedCrawling();
}

The http-status-check tool uses this implementation to display all found links to the console.

In closing

That concludes the little tour of the code. I hope you've seen that creating a crawler is not that difficult. If you want to know more, just read the code of the Crawler-class on GitHub.

My colleague Sebastian had a great idea to create a Laravel-package that provides an Artisan command to check all links of a Laravel application. You might seeing that appear amongst our current Laravel packages soon.

Stay up to date with all things Laravel, PHP, and JavaScript.

You can follow me on these platforms:

On all these platforms, regularly share programming tips, and what I myself have learned in ongoing projects.

Every month I send out a newsletter containing lots of interesting stuff for the modern PHP developer.

Expect quick tips & tricks, interesting tutorials, opinions and packages. Because I work with Laravel every day there is an emphasis on that framework.

Rest assured that I will only use your email address to send you the newsletter and will not use it for any other purposes.

Comments

What are your thoughts on "Building a crawler in PHP"?

Comments powered by Laravel Comments
Want to join the conversation? Log in or create an account to post a comment.

Webmentions

Anthony webster liked on 24th July 2019
HendubDE liked on 23rd July 2019
Stefan Tanevski liked on 23rd July 2019
Sergio Ródenas liked on 23rd July 2019
Adam Bailey liked on 23rd July 2019
Rafael Grube liked on 23rd July 2019
Spatie retweeted on 23rd July 2019
wzulfikar liked on 23rd July 2019
Richard Ottinger liked on 23rd July 2019
Justin Martin liked on 23rd July 2019
Channing Defoe liked on 23rd July 2019
Harish Patel liked on 23rd July 2019
VonBraun retweeted on 23rd July 2019
Leandro Ahmad liked on 23rd July 2019
تیفو توروالدز liked on 22nd July 2019
i360x64 liked on 22nd July 2019
Bhavdip B Pambhar liked on 22nd July 2019
eCreeth liked on 22nd July 2019
Tiagosimoes liked on 22nd July 2019
etenzy liked on 22nd July 2019
Mike liked on 22nd July 2019
Peyman Goldasteh ? liked on 22nd July 2019
Toby Maxham ? liked on 22nd July 2019
Edwin I Arellano liked on 22nd July 2019
Duncan McClean liked on 22nd July 2019
Sadegh PM liked on 22nd July 2019
Amr A.Mohammed liked on 22nd July 2019
nitp liked on 22nd July 2019
technomike liked on 22nd July 2019
SergioS liked on 22nd July 2019
technomike retweeted on 22nd July 2019
Kai liked on 22nd July 2019
Alex Renoki ? liked on 22nd July 2019
Jake Casto liked on 22nd July 2019
Logan H. Craft liked on 22nd July 2019
ArielSalvadorDev retweeted on 22nd July 2019
Omar Andrés Barbosa Ortiz liked on 22nd July 2019
Wyatt liked on 22nd July 2019
Vitalii Honcharyk liked on 22nd July 2019
Erick Patrick liked on 22nd July 2019
Dale liked on 22nd July 2019
ArielSalvadorDev liked on 22nd July 2019
Oilmone liked on 22nd July 2019
Glen Azzopardi liked on 22nd July 2019
James Healey liked on 22nd July 2019
Severen liked on 22nd July 2019
Reinier Kors liked on 22nd July 2019
Lib liked on 22nd July 2019
Mostafa Hosseini liked on 22nd July 2019
José Borges ?? liked on 22nd July 2019
Drew Roberts liked on 22nd July 2019
Ryan Mortier replied on 22nd July 2019
Do you guys use this in Oh Dear?
Dmytro Olefyrenko liked on 22nd July 2019
David Lartey ?? ?? liked on 22nd July 2019
Axel Pardemann liked on 22nd July 2019
Iftekher Sunny liked on 22nd July 2019
Dario Diaz liked on 22nd July 2019
JOSIAH YAHAYA liked on 22nd July 2019
Mohamed AbdElaziz liked on 22nd July 2019
Parker McMullin liked on 22nd July 2019
Ruslan Zavacky liked on 22nd July 2019
Ntim Yeboah liked on 22nd July 2019
Axel Pardemann retweeted on 22nd July 2019
JCarlos liked on 22nd July 2019
이현석 Hyunseok Lee liked on 22nd July 2019
JoomlaWorks liked on 22nd July 2019
Jack liked on 22nd July 2019
Christopher Dosin liked on 22nd July 2019
Patrique Ouimet liked on 22nd July 2019
Nevax liked on 22nd July 2019
Doug Black Jr liked on 22nd July 2019
Pablo Robayo ✳️ liked on 22nd July 2019
Steven Yung liked on 22nd July 2019