Search Engine Crawling

What is Search Engine Crawling?

Search engines crawl websites to find out what they contain, so that they can provide relevant results in response to users’ queries. Crawlers crawl the web looking for new and updated information. Once the crawler finds something interesting, it sends back a list of URLs to the server. These URL’s point to different pages on your website.

How to Help Search Engines Crawl Your Website?

There are two main ways to tell Google how to crawl and index your pages. One option is through the robots.txt file — a text file that gives the crawlers instructions on how to crawl the page and what to index. The second method is to use canonical and noindex tags.

Canonicalization is finds its use on domains that have a large number of pages with similar content. eCommerce stores, for example, may have several product pages that display the variants of a single item. As such, most of the content is likely to be identical, causing duplicate content issues. By implementing canonicalization, you can avoid this and direct search engine bots to the canonical page – the one you want crawled, indexed, and appearing in Google search results.

In simple terms, a canonical tag tells crawlers they should disregard the content of a URL, as it is a variation of the canonical one. Keep in mind that search engines see each URL that leads to a page on your website as different (for example: https://www.yourwebsite.com and http://www.yourwebsite.com). 

Hence, the use of canonical tags isn’t limited to product and similar pages – you can (and should) implement it on each page.

Site Navigation

Search engine crawlers reach each page on your website through links, so if your website structure is solid, they should be able to understand your site hierarchy. In addition, using internal links to connect different pages and their content ensures they are all visible to the crawlers by providing a path of links they can follow.

Including external links to domains with high authority is a good practice, as long as you are linking to a relevant page that provides additional value to the reader, since can help crawlers understand what the page is about.

Sitemaps and Why to Use Them

A sitemap is a list of all the pages on your website. It helps search engines find new content on your site easier, especially if it’s not linked from another page. If you’re using WordPress, you can set up a free plugin called Yoast SEO which will generate a sitemap automatically.

Using both an HTML and XML sitemap can help crawlers understand all the content on your pages and crawl them more efficiently.

What is the Difference Between Crawling and Indexing?

Crawling is a process in which bots visit each URL on your site and check to see whether it contains anything that matches what Google has indexed before. This process takes place every day. Indexing means that the bots index the content on your pages so they appear on in Google search results. This happens at a slower pace than crawling but still happens regularly.

The Crawl Budget

Your crawl budget is the approximate number of pages Google’s crawlers will crawl on your website in a given period – a crawl frequency. You can check your current daily crawl budget in Google Search Console.  Although this is rarely an issue for small sites, websites with thousands of pages may want to make sure Google doesn’t spend too much time crawling less important ones, instead of discovering new content and any updates on their top pages. This can be done by implementing canonical or noindex tags on pages that you either don’t want Google to crawl or index.

Learn On-page SEO

Title Tag Errors

Title tags are the first things that users notice when they see your website in the SERP, which makes them crucial if you want to capture customers’...

read more

Learn Technical SEO

XML Sitemap Errors

An XML sitemap serves as a comprehensive file that lists all the webpages on your site that you want search engines like Google to crawl and index....

read more

Learn off-page SEO