How Search Engines Work: Crawling, Indexing and Ranking Flashcards

1
Q

What are the three primary functions of search engines?

A

Crawl - Scour the internet for content, looking over the code/content for each URL they find
Index - Store and organise the content found during the crawling process. Once a page is in the index, it’s in the running to be displayed as a result to relevant queries
Rank - Provide the pieces of content that will best answer a searcher’s query, which means that results are ordered by most relevant to least relevant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is search engine crawling?

A

Crawling is the discovery process in which search engines send out a team of robots (known as crawlers or spiders) to find new and updated content.

Content can vary, it could be a webpage, an image, a video, a PDF, etc - regardless of format, content is discovered by links.

Google bot starts by fetching a few web pages and then follows the links on those pages to find new URLs. By hopping along this path of links, the crawler is able to find new content and add it to their index called Caffeine - a massive database of discovered URLs - to later be retrieved when a searcher is seeking information that the content on that URL is a good match for

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a search engine index?

A

Search engines process and store information they find in an index, a huge database of all the content they’ve discovered and deem good enough to serve up to searchers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is search engine ranking ?

A

When someone performs a search, search engines scour their index for highly relevant content and then orders that content in the hopes of solving the searcher’s query. This ordering of search results by relevance is known as ranking. In general you can assume that the higher a website is ranked, the more relevant the search engine believes that site is for the query.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Can you tell crawlers not to index pages or parts on your site?

A

Yes and there may sometimes be good reasons for doing this, but if you want pages to be visible to searchers , you have to make sure that your site is visible to crawlers in order to be indexible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Is there a way to see how many of your site’s pages are being indexed?

A

Yes, you can use the following advanced search moderator site:yourdomain.com in the Google search bar. This will return the pages Google has in its index for the site specified

For more accurate results, you can use the Index Coverage report in Google Search Console.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What might be the possible reasons as to why you’re not showing up anywhere in the search results pages

A
  • Your site is brand new and hasn’t been crawled yet
  • Your site isn’t linked to from any external websites
  • Your site’s navigation makes it hard for a robot to crawl it effectively
  • Your site contains some basic code called crawler directives that is blocking search engines
  • Your site has been penalised by Google for spammy tactics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Tell Google how to crawl your site

A

Most people think about making sure that Googlebot finds a site’s most important pages, but there are some pages you don’t want it to find. Examples include:

  • URLs that have thin content
  • duplicate URLs (such as sort-and-filter parameters for eCommerce)
  • Special promo code pages
  • Staging or test pages

To direct Googlebot away from such pages use robots.txt

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are robots.txt files?

A

Robots.txt files are located in the root directory of websites (e.g. yourdomain.com/robots.txt) and suggest which parts of your site search engines should and shouldn’t crawl, as well as the speed at which they crawl your site via specific robots.txt derivatives

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does Googlebot treat robots.txt files?

A
  • If Googlebot can’t find a robots.txt file for a site, it proceeds to crawl the site
  • If Googlebot finds a robot.txt file for a site , it will usually abide by the suggestions and proceed to crawl the site
  • If Googlebot encounters an error while trying to access a site’s robots.txt file and can’t determine if one exists or not, it won’t crawl the site
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is crawl budget?

A

Crawl budget is the average number of URLs Googlebot will crawl on your site before leaving. So, crawl budget optimisation ensures that Googlebot isn’t wasting time crawling through your unimportant pages at risk of ignoring your important pages. Crawl budget is most important on very large sites with tens of thousands of URLs, but it’s never a bad idea to block crawlers from accessing the content that you don’t care about.

Just make sure not to block crawlers’ access to pages you’ve added other directives on, such as canonical or noindex tags. If Googlebot is blocked from a page, it won’t be able to see instructions on that page.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Not all web robots follow robots.txt

A

People with bad intentions (e.g. email address scrapers) build bots that don’t follow this protocol. In fact, some bad actors use robots.txt files to find where you’ve located your private content. Although it might seem logical to block crawlers from private pages such as login and administration pages so that they don’t show up in index, placing the location of those URLs in a publicly accessible robots.txt file also means that people with malicious intent can more easily find them. It’s better to NoIndex these pages and gate them behind a login form rather than place them in your robots.txt file

Learn more about robots.txt files here: https://moz.com/learn/seo/robotstxt#index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Defining URL Parameters in GSC

A

Some sites (most common with e-commerce) make the same content available on multiple different URLs by appending certain parameters to URLs. If you’ve ever shopped online, you’ve likely narrowed your search down to search via filters. For example, you may search for “shoes” on Amazon, and then refine your search by size, colour, and style. Each time you refine, the URL changes slightly. Google tends to do a pretty good job on figuring out the representative URL on its own, but you can use the URL parameters feature in Google Search Console to tell Google exactly how you want them to treat your pages.

You can use this feature to tell Googlebot “crawl no URLs with __ parameter”. You’d be telling Googlebot to remove those pages from SERPs, which is what you’d want if those parameters are creating duplicate pages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Making sure that your important pages are being crawled

A

Sometimes crawlers will be able to find important pages on your site, but other times certain pages and sections might be obscured for one reason or other.

Ask yourself, can the bot crawl through your website and not just to it?

Is your content hidden behind login forms?If you require log ins, users to fill out forms or answer surveys before accessing content, search engines won’t be able to see those protected pages. A crawler definitely won’t log in.

Are you relying on search forms? Robots cannot use search forms. Some people believe that if they put a search box on their site, robots will be able to find everything that their users search for

Is text hidden within non-text content? Non text media forms (images, video, GIFs, etc.) should not be used to display text that you wish to be indexed. While search engines are better at recognising images, there’s no guarantee they will be able to read and understand it just yet. It’s always best to add text within the markup of your page.

Can search engines follow your site navigation? Just as a crawler needs to discover your site via links from other sites, it needs a path of links on your own site guiding it from page to page. If you’ve got a page you want search engines to find but it isn’t linked to from any other pages, it’s as good as invisible. Many sites make the critical mistake of structuring their navigation in ways that are inaccessible to search engines, hindering their ability to get listed in search results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are some common navigation mistakes that can keep crawlers from seeing all of your site?

A
  • Having a mobile navigation that shows different results to your desktop navigation
  • Any type of navigation where the menu items are not in the HTML such as Javascript-enabled navigations. Google has gotten much better at crawling and understanding Javascript, but it’s still not perfect. The most surefire way of making sure that something gets found by Google is by putting it in the HTML
  • Personalisation, or showing unique navigation to a specific type of visitor versus others, could appear to be cloaking to a search engine crawler
  • Forgetting to link to a primary page on your website through your navigation - remember, links are the paths crawlers follow to new pages. This is why it’s essential that your site has a clear navigation and helpful URL folder structures.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Do you have clean information architecture?

A

Information architecture is the practice of organising and labelling content on a website to improve efficiency and findability for users. The best information architecture is intuitive, meaning that users shouldn’t have to think very hard to navigate through your website.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Are you utilising site maps?

A

A sitemap is just as it sounds, a list of URLs that crawlers can use to discover and index your content. One of the easiest ways to ensure Google is finding your highest priority pages is to create a file that meets Google’s standards and submit it through Google Search Console. While submitting a sitemap doesn’t replace the need for good site navigation, it can certainly help crawlers follow the path to all of your important pages.

  • Ensure that you’ve only included URLs that you want indexing by search engines, and be sure to give crawlers consistent directions. For example, don’t include a URL in your sitemap if you’ve blocked that URL via robots.txt or include URLs in your sitemap that are duplicates rather than the preferred, canonical version
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Can sitemaps be beneficial if you don’t have any other sites linking to your site?

A

If your site doesn’t have any other sites linking to it, you still might be able to get it indexed by submitting your XML sitemap in Google Search Console. There’s not guarantee they’ll include a submitted URL in their index, but it’s worth a try

19
Q

Are crawlers getting errors when they try to access your URLs?

A

In the process of crawling the URL on your site, a crawler may encounter errors. You can go to Google Search Console’s Crawl Errors report to detect URLs on which this might be happening - this report will show you server errors and not found errors. Server log files can also show you this, as well as a treasure trove of other information such as crawl frequency.

Accessing and dissecting server log files is an advanced tactic that you can learn more about here: https://moz.com/blog/log-file-analysis

20
Q

What are 4xx Codes?

A

When search engines can’t access your content due to a client error.

4xx errors are client errors, meaning that the requested URL contains bad syntax or cannot be fulfilled. One of the most common 4xx errors is the 404 error. This might occur because of a URL typo, deleted page, or broken redirect, just to name a few examples. When search engines hit a 404, they can’t access the URL. When users hits a 404 error, they can get frustrated and leave.

21
Q

What are 5xx errors?

A

When search engine crawlers can’t access your content due to a server error.

5xx errors are server errors, meaning the server the webpage is located on failed to fulfil the searcher or search engine’s request to access the page. In Google Search Console’s Crawl Error report, there is a tab dedicated to these errors. These typically happen because the request for the URL timed out, so Googlebot abandoned the request.

You can review Google’s Documentation to learn more about fixing server connectivity issues: https://support.google.com/webmasters/answer/35120?hl=en&visit_id=636743634369653164-3372171668&rd=1

22
Q

Is there a way to tell both searchers and search engines that your page has moved?

A

Yes, the 301 (permanent) redirect

Say you move a page from example.com/young-dogs/ to example.com/puppies/. Search engines and users need a bridge to cross from the old URL to the new. That bridge is a 301 redirect.

  • Without a 301, the authority from the previous URL is not passed on to the new version of the URL
  • Helps with indexing as it helps Google find the new version of the page
  • The presence of 404 errors on your site alone don’t harm search performance, but letting ranking / trafficked pages 404 can result in them falling out of the index, with rankings and traffic going with them — yikes!
  • If a page is ranking for a piece of content, its ranking might drop if you move it to a page with different content because the query that made it rank for a particular search isn’t there anymore
23
Q

What are 302 redirects?

A

You also have the choice of redirecting a page using the 302 redirect but this should be reserved for those pages that only require temporary moves and in cases where passing link equity (ranking) isn’t as big of a concern. 302s are like a road detour, you’re siphoning traffic through a certain route, but it won’t be that way forever

24
Q

Watch out for redirect chains

A

Redirect chains occur when the Googlebot has to go through multiple redirects. Google recommends limiting these as much as possible. If you have a potential redirect chain of 1 url > 2nd Url > 3rd URL, it’s best to just remove the middle man and take out the second URL in the chain.

25
Q

Can I see how Googlebot crawler sees my pages?

A

Yes, the cached version of your page will reflect a snapshot of the last time Googlebot crawled it. To view cached version of site click on the arrow next to the site in the serps and click cached.
If in doubt, search “how to see cached version of site” in Google

Google crawls and caches web pages at different frequencies. More well known sites will be crawled more frequently than much less famous ones

You can also check the text version of the site to check that all of your important pages are being crawled and cached effectively

26
Q

Are pages ever removed from the index?

A

Yes, see possible reasons below:

  • The URL is returning a “not found” error (4XX) or server error (5XX) – This could be accidental (the page was moved and a 301 redirect was not set up) or intentional (the page was deleted and 404ed in order to get it removed from the index)
  • The URL had a noindex meta tag added – This tag can be added by site owners to instruct the search engine to omit the page from its index.
  • The URL has been manually penalized for violating the search engine’s Webmaster Guidelines and, as a result, was removed from the index.
  • The URL has been blocked from crawling with the addition of a password required before visitors can access the page.

You can get more detail of why a URL may not be in the index anymore with the URL inspection tool: https://support.google.com/webmasters/answer/9012289 or you can even use the fetch as Google tool to try and get individual URLs submitted to the index: https://support.google.com/webmasters/answer/6066468?hl=en

GSC also has a fetch tool which has render option that can help you to find out what the issues are with how Google is interpreting your page and maybe why it won’t index

27
Q

How to tell search engines how you’d like them to index your site

A
  • Meta directives or meta tags are instructions you can give to search engines regarding how you want your web page to be treated. You can tell search engine crawlers things like “do not index this page in search results” or “don’t pass any link equity to any on-page links.” These instructions are executed via robot meta tags in the of your HTML pages (most commonly used) or via the X-Robots-Tag in the HTTP header.
28
Q

Robots Meta Tag

A

The robots meta tag can be used within the of the HTML of your webpage. It can exclude all of your specific search engines. The following are the most common meta directives, along with the situations you might apply them in.

  • Index/Noindex tells the search engines whether the page should be crawled and kept in a search engine’s index for retrieval. If you opt to use “noindex”, you’re communicating to crawlers that you want the page excluded from search results. By default, search engines assume they can index all pages, so using “index” value is unnecessary
    You might opt to mark a page as noindex if you’re trying to trim thin pages from Google’s index of your site (ex: user generated profile pages) but you still want them accessible to visitors.

Follow/nofollow tells search engines whether links on the page should be followed or nofollowed. “Follow” results in bots following the links on your page and passing link equity through to those URLs. Or, if you elect to employ “nofollow” the search engines will not follow or pass any link equity through to the links on the page. By default, all pages are assumed to have the “follow” attribute.

When you might use: nofollow is often used together with noindex when you’re trying to prevent a page from being indexed as well as prevent the crawler from following links on page.

Noarchive is used to restrict search engines from saving a cached copy of each page. By default, the engines will maintain visible copies of all pages they have indexed, accessible to searchers through the cached link in the search results.

When you might use: if you run an e-commerce site and your prices change regularly. You might consider the noarchive tag to prevent searchers from seeing outdated pricing.

29
Q

X-Robots-Tag

A

The x-robots-tag is used within the HTTP header of your URL, providing more flexibility and functionality than meta tags if you want to block search engines at scale because you can use regular expressions, block non-HTML files, and apply sitewide noindex tags

For more info on robot meta tags: https://developers.google.com/search/reference/robots_meta_tag

30
Q

Wordpress tip for blocking search engine visibility

A

In Dashboard > Settings > Reading, make sure the “Search Engine Visibility” box is not checked. This blocks search engines from coming to your site via your robots.txt file!

31
Q

Ranking: How do search engines rank URLs?

A

How do search engines ensure that when someone types a query into the search bar, they get relevant results in return? That process is known as ranking, or the ordering of search results by most relevant to least relevant to a particular query.

To determine relevance, search engines use algorithms. There have been many algorithm changes over the years. Google for example, makes algorithm changes almost everyday - some of these updates are minor quality tweaks, whereas others are core/broad algorithm updates deployed to tackle a specific issue. Like Penguin to tackle link spam

Check out the following link for a history of algorithm changes: https://moz.com/google-algorithm-change

32
Q

What can you do if your site suffered following an algorithm change?

A

You can check the guidelines after an algorithm change to see why Google might have penalised your site:

  • https://support.google.com/webmasters/topic/6001971?hl=en&ref_topic=6001981
  • https://static.googleusercontent.com/media/www.google.com/en//insidesearch/howsearchworks/assets/searchqualityevaluatorguidelines.pdf
33
Q

Why does SEO appear different than in previous years?

A

Before it was easier to use tactics to rank higher using SEO without getting penalised for it. But as Google algorithm improvements are constantly being improved to improve the user experience in turn, this has become more difficult to do. Google are trying to encourage an authentic, useful experience for their users.

34
Q

Keyword stuffing example

A

Welcome to funny jokes! We tell the funniest jokes in the world. Funny jokes are fun and crazy. Your funny joke awaits. Sit back and read funny jokes because funny jokes can make you happy and funnier. Some funny favorite funny jokes.

35
Q

The role links play in SEO

A

Backlinks or inbound links - links from other sites to yours
Internal links - are links you use to connect your own content

Links work like a real life word of mouth scenario where a person or business is recommended to another because of their service or product

The importance of links was why PageRank was created. PageRank (part of Google’s core algorithm) is a link analysis algorithm named after one of Google’s founders, Larry Page. PageRank estimates the importance of a web page by measuring the quality and quantity of links pointing to it. The assumption is that the more relevant, important, and trustworthy a web page is, the more links it will have earned. The more natural backlinks you have from high-authority (trusted) websites , the better your odds are to rank higher in within search results.

36
Q

The role content plays in SEO

A

A big part of how Google decides if your content should rank highly is how well your content matches a searcher’s query and the query’s intent

37
Q

What is RankBrain?

A

RankBrain is the machine learning aspect of Google’s core algorithm. If RankBrain notices that a lower ranking URL is providing a better result to users than the higher ranking URLs , you can bet that RankBrain will adjust those results , moving the more relevant results higher and demoting the lesser relevant pages as a byproduct.

We don’t know exactly what comprises RankBrain, apparently the folks at Google don’t either: https://www.seroundtable.com/google-dont-understand-rankbrain-21744.html

38
Q

Engagement metrics: Correlation, Causation or both?

A

With Google rankings, engagement metrics are most likely part correlation and part causation.

When we say engagement metrics, we mean data that represents how searchers interact with your site from search results. This includes things like:

Clicks (visits from search)
Time on page (amount of time the visitor spent on a page before leaving it)
Bounce rate (the percentage of all website sessions where users viewed only one page)
Pogo-sticking (clicking on an organic result and then quickly returning to the SERP to choose another result)
Many tests, including Moz’s own ranking factor survey, have indicated that engagement metrics correlate with higher ranking, but causation has been hotly debated. Are good engagement metrics just indicative of highly ranked sites? Or are sites ranked highly because they possess good engagement metrics?

Google have said that if they notice that one organic search results gets more clicks than another, say 20% compared to 10%, they will change the ranking accordingly

This doesn’t necessarily mean that clicks are a ranking signal though, so be careful with that distinction.

39
Q

What have tests shown us in terms of the engagement/rank relationship?

A
  • One test had 200 people click on a URL and interestingly they went from position #7 to #1 but this change seemed to be based on geo location as well, where rank spiked in the US only.
  • Another test found that RankBrain seems to demote pages that people don’t spend as much time on.
  • Another test found that use behaviour impacted map pack and local search results as well.

It’s safe to say that engagement does affect the ranking of your URLs, but they seem to be like a fact checker. Objective ranking factors like backlinks are calculated first and then engagement may be taken into account.

40
Q

What are some examples of search features?

A
Paid advertisements
Featured snippets
People Also Ask boxes
Local (map) pack
Knowledge panel
Sitelinks

Google add features all the time and even experimented with a “Zero Results Serp” a phenomenon where only one result from the knowledge Graph was displayed on the SERP with no results below it except for an option to view more results.

41
Q

Notice how different types of SERP features match the different types of query intents

A

Informational - Features Snippet
Informational with on answer - Knowledge Graph / Instant Answer
Local - Map Pack
Transactional - Shopping

42
Q

Localised Search

A

If you are performing SEO for a local business like a dentist, make sure that you claim verify and optimise a free Google My Business Listing

When it comes to localised search results, Google uses three main factors to determine ranking: Relevance, Distance and Prominence

43
Q

The three main factors Google uses to determine ranking for localised SERPs

A

Relevance - How well a local business matches with what the user is looking for. To ensure that the business is doing everything it can to be relevant to searchers, make sure the business’ information is fully and accurately filled out
Distance - Google use your geo-location to better serve you local results. Local search results are extremely sensitive to proximity, which refers to the location of the searcher and/or the location in the query if the searcher specified one

Organic search results are sensitive to user’s location, though seldom as pronounced as in local pack results.

Prominence - With prominence as a factor, Google is looking to reward businesses that are well-known in the real world. In addition to a business’ offline prominence, Google also looks to some online factors to determine local ranking, such as:

Reviews: The number of reviews a local business receives, and the sentiment of those reviews, have a notable impact on their ability to rank in local results

Citations: A business citation or business listing is a web-based reference to a local business “NAP” (Name, address, phone number) on a localised platform (Yelp, Acxiom, YP, Infogroup, Localeze, etc.)

Local rankings are influenced by the number and consistency of local business citations. Google pulls data from wide variety of sources in continuously making up its local business index. When Google finds multiple consistent references to a business’s name, location and phone number it strengthens Google’s trust in the validity of the data. This then leads to Google being able to show the business with a higher degree of confidence. Google uses other information from the web such as links and articles.

44
Q

How does SEO work for local businesses?

A

Best practices for general SEO also apply to local SEO

Although not listed by Google as a local ranking factor, the role of engagement is only going to increase as time goes on. Google is continuing to enrich data by incorporating real-world data into analytics - think popular shop visit times and average length of visits.

Google are even offering people the option to ask businesses questions.