Crawler Management Flashcards
Why does crawl budget matter to Google?
Two reasons:
- They don’t want to overwhelm your servers
- The web is huge. Crawling costs money. They want to conserve resources and only use them on the most important pages.
What is ‘crawl rate’?
The number of requests that Google’s crawlers make to your site each second
Bonus tip: In ‘SEO Mythbusting’ Martin Splitt described crawl rate as the amount of stress Google can put on your server without overwhelming it
What affects crawl demand?
- The popularity of the site
- The size of the site, also known as perceived URL inventory (though a large, unpopular site may not necessarily be crawled very often)
- The perceived freshness of a site - e.g. a Wiki page on the history of WWI may not change as frequently as a news site
- How many pages on a site have not been crawled (especially in the case of a migration)
What is crawl demand?
How often Google wants to crawl your site
Hypothetically, if a site ranked really well but was not crawled very often (low crawl demand), why could that be?
They have quality content, but the content does not change very often. This could be due to the nature of the content (dictated by the niche the website is in)
What are some ways you can indicate to Google how frequently content needs to be crawled?
- The last modified date in sitemap
- dateModified in structured data
- The etag header (used to indicate a specific version of the resource)
Which types of sites should worry about crawl budget?
- Large sites (1m+ pages)
- Medium or large sites (10,000+ pages) with very rapidly changing content
- Sites with very ‘flaky’ server setups (but then your issue is not really your site, but your servers)
List two ways to control crawl budget for large sites with UGC
- Limit crawling in robots.txt to content that you know to be low quality
- No follow internal links
Does crawl budget affect Google’s rendering of web pages?
Yes, because requests for rendering-related resources are counted in your crawl budget
How do you reduce the Google crawl rate?
- Set it manually usually Google Search Console
- Serve 500, 503, or 429s TEMPORARILY, as this will reduce Google’s crawl rate if they receive a lot of these response codes. If you do this for more than 1-2 days you may be dropped from the index
How can you verify that it is Googlebot crawling your website?
You can use a reverse DNS lookup, using the host command in terminal.
E.g. host 66.249.66.1
Returns
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.
Then, you must use a forward DNS lookup to verify that the domain name is indeed associated with the IP address you performed the reverse DNS on. This is to ensure that the original IP address is not DNS spoofing.
host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
What is the crawl capacity limit?
Googlebot wants to crawl your site without overwhelming your servers. To prevent this, Googlebot calculates a crawl capacity limit, which is the maximum number of simultaneous parallel connections that Googlebot can use to crawl a site, as well as the time delay between fetches. This is calculated to provide coverage of all your important content without overloading your servers.
What are the ways that maximise the efficiency with which Google crawls a site?
- Manage URL inventory: consolidate duplicate content & block crawling of URLs using robots.txt (Gaston: use no follow)
- Return a 404 for permanently removed pages. Also manage soft 404s
- Keep up to date sitemaps
- Avoid long redirect chains
- Fast loading pages
- Adjust crawl settings
- Increase your server capacity
How do you figure out whether Google is encountering crawl availability issues?
Use the Crawl Stats report
Broadly, what data is available in the Crawl Stats report?
- Overview: request number, bytes, response time
- Response codes
- File type
- Host status (robots.txt, dns, server availability)
- Googlebot type
- Purpose of crawling (discovery, refresh)
What affects Google’s crawl rate?
Largely the responsiveness of your server, but also whether you limit it in search console
Why would you want to limit Google’s crawl rate?
Because Google is overloading your servers with requests
What is crawl budget?
The number of URLs that Google can (crawl rate) and wants (crawl demand) to crawl
What are some ways to increase crawl budget?
- Faster servers
- Ensure crawl budget is not limited in Search Console settings
- Create valuable content that Google wants to crawl and that is popular with users (increases crawl demand)