A Laravel package to crawl and index content of your sites

Original – Oct 19th 2021 by Freek Van der Herten – 12 minute read

The newly released spatie/laravel-site-search package can crawl and index the content of one or more sites. You can think of it as a private Google search for your sites. Like most Spatie packages, it is highly customizable: you have total control over what content gets crawled and indexed.

To see the package in action, head over to the search page of this very blog.

In this post, I'd like to introduce the package to you and highlight some implementation and testing details. Let's dig in!

Are you a visual learner?

In this stream on YouTube, I demo the package and dive into its source code and tests.

Why did we create this package?

In our ecosystem, there already are several options to create a search index. Let's compare them with our new package.

Laravel Scout is an excellent package to add search capabilities for Eloquent models. In most cases, this is very useful if you want to provide a structured search. For example, if you have a Product model, Scout can help build up a search index to search the properties of these products.

The main difference between Scout and laravel-site-search is that laravel-site-search is not tied to Eloquent models. Like Google, it will crawl your entire site and index all content that is there.

Another nice indexing option is Algolia Docsearch. It can add search capabilities to open-source documentation for free.

Our laravel-site-search package may be used to index non-open-source stuff as well. Where Docsearch makes basic assumptions on how the content is structured, our package tries to make a best effort to index all kinds of content.

You could also opt to use Meilisearch Doc Scraper, which you can use for non-open-source content. It's written in Python, so it's not that easy to integrate with a PHP app.

Our package is, of course, written in PHP and can be customized very easily; you can even add custom properties.

So summarised, our package can be used for all kinds of content, and it can be easily customized when installed in a Laravel app.

Crawling your first site

First, you must follow the installation instructions. This involves installing the package and installing Meilisearch. The docs even mention how you can install and run Meilisearch on a Forge provisioned server.

After you've installed the package, you can run this command to define a site that needs to be indexed.

php artisan site-search:create-index

This command will ask for a name for your index and the URL of your site that should be crawled. Of course, you could run that command multiple times to create multiple indexes.

After that, you should run this command to start a queued job. You should probably schedule that command to run every couple of hours so that the index is kept in sync with your site's latest content.

php artisan site-search:crawl

That job that is started by that command will:

create a new Meilisearch index
crawl your website using multiple concurrent connections to improve performance
transform crawled content to something that can be put in the search index
mark that new Meilisearch index as the active one
delete the old Meilisearch index

Finally, you can use Search to perform a query on your index.

use Spatie\SiteSearch\Search;

$searchResults = Search::onIndex($indexName)
    ->query('your query')
    ->get();

This is how you could render the results in a Blade view:

<ul>
    @foreach($searchResults->hits as $hit)
        <li>
            <a href="{{ $hit->url }}">
                <div>{{ $hit->url }}</div>
                <div>{{ $hit->title() }}</div>
                <div>{!! $hit->highlightedSnippet() !!}</div>
            </a>
        </li>
    @endforeach
</ul>

That is basically how you can use the package. On the search page of this very blog, you can see the package in action. I've also open-sourced my blog, so on GitHub, you'll be able to see the Livewire component and Blade view that power the search page.

Customizing what gets crawled and indexed

In most cases, you don't want to index all content that is available on your site. A few examples of this are menu structures or list pages (e.g. a list with blog posts with links to the detail pages of those posts).

We've made it easy to ignore such content. In the config file there's an option ignore_content_on_urls. Your homepage probably contains no unique content but rather links to pages where the full content is.

You can ignore the content on the homepage by adding /. We'll still crawl the homepage but not put any of its content in the index.

/*
 * When crawling your site, we will ignore content that is on these URLs.
 *
 * All links on these URLs will still be followed and crawled.
 *
 * You may use `*` as a wildcard.
 */
'ignore_content_on_urls' => [
    '/'
],

You can also ignore content based on CSS selectors. There's an option ignore_content_by_css_selector in the config file that lets you specify any CSS selection.

If your menu structure is in a nav element, you can add nav. You could also introduce a data attribute that you could slap on any content you don't want in your index.

So with this configuration:

/*
 * When indexing your site, we will ignore any content to the search index
 * that is selected by these CSS selectors.
 *
 * All links inside such content will still be crawled, so it's safe
 * it's safe to add a selector for your menu structure.
 */
'ignore_content_by_css_selector' => [
    'nav',
    '[data-no-index]',
],

... this div won't get indexed:

<div>
   This will get indexed
</div>
<div data-no-index>
   This won't get indexed but <a href="/other-page">this link</a> will still be followed.
</div>

Using a search profile

For a lot of users, the above config options will be enough. If you want to control what gets indexed and crawled programmatically, you can use a search profile.

A search profile determines which pages get crawled and what content gets indexed. In the site-search config file, you'll win the default_profile key that the Spatie\SiteSearch\Profiles\DefaultSearchProfile::class is being use by default.

This default profile will instruct the indexing process:

to crawl each page of your site
to only index any page that had 200 as the status code of its response
to not index a page if the response had a header site-search-do-not-index

By default, the crawling process will respect the robots.txt of your site.

If you want to customize the crawling and indexing behaviour, you could opt to extend Spatie\SiteSearch\Profiles\DefaultSearchProfile or create your own class that implements the Spatie\SiteSearch\Profiles\SearchProfile interface. This is how that interface looks like.

namespace Spatie\SiteSearch\Profiles;

use Psr\Http\Message\ResponseInterface;
use Psr\Http\Message\UriInterface;
use Spatie\SiteSearch\Indexers\Indexer;

interface SearchProfile
{
    public function shouldCrawl(UriInterface $url, ResponseInterface $response): bool;
    public function shouldIndex(UriInterface $url, ResponseInterface $response): bool;
    public function useIndexer(UriInterface $url, ResponseInterface $response): ?Indexer;
    public function configureCrawler(Crawler $crawler): void;
}

Indexing extra properties

Only the page title, URL, description, and some content are added to the search index by default. However, you can add any extra property you want.

You do this by using a custom indexer and override the extra method.

class YourIndexer extends Spatie\SiteSearch\Indexers\DefaultIndexer
{
    public function extra() : array{
        return [
            'authorName' => $this->functionThatExtractsAuthorName()
        ]    
    }
    
    public function functionThatExtractsAuthorName()
    {
        // add logic here to extract the username using
        // the `$response` property that's set on this class
    }
}

The extra properties will be available on a search result hit.

$searchResults = SearchIndexQuery::onIndex('my-index')->search('your query')->get(); 

$firstHit = $searchResults->hits->first();

$firstHit->authorName; // returns the author name

Let's take a look at the tests

When writing tests, I usually prefer to write feature tests. They give me the highest confidence that everything is working correctly.

In the case of this package, a proper feature test would encompass crawling and indexing a site, then perform a query to the built-up search index, and verify if the results are correct.

In our test suite, we do precisely that. Let's first take a look at the test itself.

it('can crawl and index all pages', function () {
   Server::activateRoutes('chain');

   dispatch(new CrawlSiteJob($this->siteSearchConfig));

   waitForMeilisearch($this->siteSearchConfig);

   $searchResults = Search::onIndex($this->siteSearchConfig->name)
       ->query('here')
       ->get();

   expect(hitUrls($searchResults))->toEqual([
       'http://localhost:8181/',
       'http://localhost:8181/2',
       'http://localhost:8181/3',
   ]);
});

The site that we're going to crawl is not a real site. The used crawl_url in $this->siteSearchConfig is set to localhost:8181. This site is served by a Lumen application, that is booted whenever the tests run.

The first line of our test is Server::activateRoutes('chain'). This will make our Lumen application load and use a certain routes file. In this case, we will let our Lumen app use the chain.php routes file. This is what that routes file looks like:

$router->get('/', fn () => view('chain/1'));
$router->get('2', fn () => view('chain/2'));
$router->get('3', fn () => view('chain/3'));

So basically, our Lumen app now is a mini-site that serves a couple of chained pages.

In the following lines of our test, we're dispatching the job that will crawl and indexed that site.


// in our test

dispatch(new CrawlSiteJob($this->siteSearchConfig));

waitForMeilisearch($this->siteSearchConfig);

That waitForMeilisearch also deserves a bit of explanation. When something is being saved in a Meilisearch index, that bit of info won't be indexed immediately. Meilisearch needs a bit of time to process everything. Our tests need to wait on that because otherwise, our test may randomly fail because sometimes our exceptions would run before the indexing is complete.

Luckily, Meilisearch has an API that can determine whether all updates to an index are processed. Here's the implementation of waitForMeilisearch. We simply wait for Meilisearch's processing to be done.

function waitForMeilisearch(SiteSearchConfig $siteSearchConfig): void
{
    $indexName = $siteSearchConfig->refresh()->index_name;

    while (MeiliSearchDriver::make($siteSearchConfig)->isProcessing($indexName)) {
        sleep(1);
    }
}

After Meilisearch has done its work, we will perform a query against the Meilisearch index and expect certain URLs to be returned.

// in our test

$searchResults = Search::onIndex($this->siteSearchConfig->name)
    ->query('here')
    ->get();

expect(hitUrls($searchResults))->toEqual([
    'http://localhost:8181/',
    'http://localhost:8181/2',
    'http://localhost:8181/3',
]);

With that Lumen test server a waitForMeilisearch function, we can test most functionalities of the package. Here's the test that makes sure the ignore_content_on_urls option is working.

When crawling the same chain as above but add ignore_content_on_urls to the pages to ignore, we expect that / and /3 are in the index.

it('can be configured not to index certain urls', function () {
    Server::activateRoutes('chain');

    config()->set('site-search.ignore_content_on_urls', [
        '/2',
    ]);

    dispatch(new CrawlSiteJob($this->siteSearchConfig));

    waitForMeilisearch($this->siteSearchConfig);

    $searchResults = Search::onIndex($this->siteSearchConfig->name)
        ->query('here')
        ->get();

    expect(hitUrls($searchResults))->toEqual([
        'http://localhost:8181/',
        'http://localhost:8181/3',
    ]);
});

This kind of test gives me a lot of confidence that everything in the package is working correctly. If you want to see more tests, head over to the test suite on GitHub.

In closing

I hope that you like this little tour of the package. There are a lot of options not mentioned in this blog post: You can create synonyms, extra properties can be added, and much more.

We spent a lot of time making every aspect of the crawling and indexing behaviour customizable. Discover all the options in our extensive docs.

This isn't the first package that our team has made. Our website has an open source section that lists every package that our team has made. I'm pretty sure that there's something there for your next project. Currently, all of our package combined are being downloaded 10 million times a month.

Our team doesn't only create open-source, but also paid digital products, such as Ray, Mailcoach and Flare. Our team also creates premium video courses, such as Laravel Beyond CRUD, Testing Laravel, Laravel Package Training and Event Sourcing in Laravel. If you want to support our open source efforts, consider picking up one of our paid products.