Crawler Management Flashcards

Question 1

Q

Why does crawl budget matter to Google?

Answer

A

Two reasons:

They don’t want to overwhelm your servers
The web is huge. Crawling costs money. They want to conserve resources and only use them on the most important pages.

Question 2

Q

What is ‘crawl rate’?

Answer

A

The number of requests that Google’s crawlers make to your site each second

Bonus tip: In ‘SEO Mythbusting’ Martin Splitt described crawl rate as the amount of stress Google can put on your server without overwhelming it

Question 3

Q

What affects crawl demand?

Answer

A

The popularity of the site
The size of the site, also known as perceived URL inventory (though a large, unpopular site may not necessarily be crawled very often)
The perceived freshness of a site - e.g. a Wiki page on the history of WWI may not change as frequently as a news site
How many pages on a site have not been crawled (especially in the case of a migration)

Question 4

Q

What is crawl demand?

Answer

A

How often Google wants to crawl your site

Question 5

Q

Hypothetically, if a site ranked really well but was not crawled very often (low crawl demand), why could that be?

Answer

A

They have quality content, but the content does not change very often. This could be due to the nature of the content (dictated by the niche the website is in)

Question 6

Q

What are some ways you can indicate to Google how frequently content needs to be crawled?

Answer

A

The last modified date in sitemap
dateModified in structured data
The etag header (used to indicate a specific version of the resource)

Question 7

Q

Which types of sites should worry about crawl budget?

Answer

A

Large sites (1m+ pages)
Medium or large sites (10,000+ pages) with very rapidly changing content
Sites with very ‘flaky’ server setups (but then your issue is not really your site, but your servers)

Question 8

Q

List two ways to control crawl budget for large sites with UGC

Answer

A

Limit crawling in robots.txt to content that you know to be low quality
No follow internal links

Question 9

Q

Does crawl budget affect Google’s rendering of web pages?

Answer

A

Yes, because requests for rendering-related resources are counted in your crawl budget

Question 10

Q

How do you reduce the Google crawl rate?

Answer

A

Set it manually usually Google Search Console
Serve 500, 503, or 429s TEMPORARILY, as this will reduce Google’s crawl rate if they receive a lot of these response codes. If you do this for more than 1-2 days you may be dropped from the index

Question 11

Q

How can you verify that it is Googlebot crawling your website?

Answer

A

You can use a reverse DNS lookup, using the host command in terminal.

E.g. host 66.249.66.1

Returns

1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

Then, you must use a forward DNS lookup to verify that the domain name is indeed associated with the IP address you performed the reverse DNS on. This is to ensure that the original IP address is not DNS spoofing.

host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

Question 12

Q

What is the crawl capacity limit?

Answer

A

Googlebot wants to crawl your site without overwhelming your servers. To prevent this, Googlebot calculates a crawl capacity limit, which is the maximum number of simultaneous parallel connections that Googlebot can use to crawl a site, as well as the time delay between fetches. This is calculated to provide coverage of all your important content without overloading your servers.

Question 13

Q

What are the ways that maximise the efficiency with which Google crawls a site?

Answer

A

Manage URL inventory: consolidate duplicate content & block crawling of URLs using robots.txt (Gaston: use no follow)
Return a 404 for permanently removed pages. Also manage soft 404s
Keep up to date sitemaps
Avoid long redirect chains
Fast loading pages
Adjust crawl settings
Increase your server capacity

Question 14

Q

How do you figure out whether Google is encountering crawl availability issues?

Answer

A

Use the Crawl Stats report

Question 15

Q

Broadly, what data is available in the Crawl Stats report?

Answer

A

Overview: request number, bytes, response time
Response codes
File type
Host status (robots.txt, dns, server availability)
Googlebot type
Purpose of crawling (discovery, refresh)

Question 16

Q

What affects Google’s crawl rate?

Answer

Study These Flashcards

A

Largely the responsiveness of your server, but also whether you limit it in search console

Question 17

Q

Why would you want to limit Google’s crawl rate?

Answer

Study These Flashcards

A

Because Google is overloading your servers with requests

Question 18

Q

What is crawl budget?

Answer

Study These Flashcards

A

The number of URLs that Google can (crawl rate) and wants (crawl demand) to crawl

Question 19

Q

What are some ways to increase crawl budget?

Answer

Study These Flashcards

A

Faster servers
Ensure crawl budget is not limited in Search Console settings
Create valuable content that Google wants to crawl and that is popular with users (increases crawl demand)

Crawler Management Flashcards

(19 cards)