How Search Engines Work: Crawling, Indexing and Ranking Flashcards
What are the three primary functions of search engines?
Crawl - Scour the internet for content, looking over the code/content for each URL they find
Index - Store and organise the content found during the crawling process. Once a page is in the index, it’s in the running to be displayed as a result to relevant queries
Rank - Provide the pieces of content that will best answer a searcher’s query, which means that results are ordered by most relevant to least relevant
What is search engine crawling?
Crawling is the discovery process in which search engines send out a team of robots (known as crawlers or spiders) to find new and updated content.
Content can vary, it could be a webpage, an image, a video, a PDF, etc - regardless of format, content is discovered by links.
Google bot starts by fetching a few web pages and then follows the links on those pages to find new URLs. By hopping along this path of links, the crawler is able to find new content and add it to their index called Caffeine - a massive database of discovered URLs - to later be retrieved when a searcher is seeking information that the content on that URL is a good match for
What is a search engine index?
Search engines process and store information they find in an index, a huge database of all the content they’ve discovered and deem good enough to serve up to searchers.
What is search engine ranking ?
When someone performs a search, search engines scour their index for highly relevant content and then orders that content in the hopes of solving the searcher’s query. This ordering of search results by relevance is known as ranking. In general you can assume that the higher a website is ranked, the more relevant the search engine believes that site is for the query.
Can you tell crawlers not to index pages or parts on your site?
Yes and there may sometimes be good reasons for doing this, but if you want pages to be visible to searchers , you have to make sure that your site is visible to crawlers in order to be indexible.
Is there a way to see how many of your site’s pages are being indexed?
Yes, you can use the following advanced search moderator site:yourdomain.com in the Google search bar. This will return the pages Google has in its index for the site specified
For more accurate results, you can use the Index Coverage report in Google Search Console.
What might be the possible reasons as to why you’re not showing up anywhere in the search results pages
- Your site is brand new and hasn’t been crawled yet
- Your site isn’t linked to from any external websites
- Your site’s navigation makes it hard for a robot to crawl it effectively
- Your site contains some basic code called crawler directives that is blocking search engines
- Your site has been penalised by Google for spammy tactics
Tell Google how to crawl your site
Most people think about making sure that Googlebot finds a site’s most important pages, but there are some pages you don’t want it to find. Examples include:
- URLs that have thin content
- duplicate URLs (such as sort-and-filter parameters for eCommerce)
- Special promo code pages
- Staging or test pages
To direct Googlebot away from such pages use robots.txt
What are robots.txt files?
Robots.txt files are located in the root directory of websites (e.g. yourdomain.com/robots.txt) and suggest which parts of your site search engines should and shouldn’t crawl, as well as the speed at which they crawl your site via specific robots.txt derivatives
How does Googlebot treat robots.txt files?
- If Googlebot can’t find a robots.txt file for a site, it proceeds to crawl the site
- If Googlebot finds a robot.txt file for a site , it will usually abide by the suggestions and proceed to crawl the site
- If Googlebot encounters an error while trying to access a site’s robots.txt file and can’t determine if one exists or not, it won’t crawl the site
What is crawl budget?
Crawl budget is the average number of URLs Googlebot will crawl on your site before leaving. So, crawl budget optimisation ensures that Googlebot isn’t wasting time crawling through your unimportant pages at risk of ignoring your important pages. Crawl budget is most important on very large sites with tens of thousands of URLs, but it’s never a bad idea to block crawlers from accessing the content that you don’t care about.
Just make sure not to block crawlers’ access to pages you’ve added other directives on, such as canonical or noindex tags. If Googlebot is blocked from a page, it won’t be able to see instructions on that page.
Not all web robots follow robots.txt
People with bad intentions (e.g. email address scrapers) build bots that don’t follow this protocol. In fact, some bad actors use robots.txt files to find where you’ve located your private content. Although it might seem logical to block crawlers from private pages such as login and administration pages so that they don’t show up in index, placing the location of those URLs in a publicly accessible robots.txt file also means that people with malicious intent can more easily find them. It’s better to NoIndex these pages and gate them behind a login form rather than place them in your robots.txt file
Learn more about robots.txt files here: https://moz.com/learn/seo/robotstxt#index
Defining URL Parameters in GSC
Some sites (most common with e-commerce) make the same content available on multiple different URLs by appending certain parameters to URLs. If you’ve ever shopped online, you’ve likely narrowed your search down to search via filters. For example, you may search for “shoes” on Amazon, and then refine your search by size, colour, and style. Each time you refine, the URL changes slightly. Google tends to do a pretty good job on figuring out the representative URL on its own, but you can use the URL parameters feature in Google Search Console to tell Google exactly how you want them to treat your pages.
You can use this feature to tell Googlebot “crawl no URLs with __ parameter”. You’d be telling Googlebot to remove those pages from SERPs, which is what you’d want if those parameters are creating duplicate pages.
Making sure that your important pages are being crawled
Sometimes crawlers will be able to find important pages on your site, but other times certain pages and sections might be obscured for one reason or other.
Ask yourself, can the bot crawl through your website and not just to it?
Is your content hidden behind login forms?If you require log ins, users to fill out forms or answer surveys before accessing content, search engines won’t be able to see those protected pages. A crawler definitely won’t log in.
Are you relying on search forms? Robots cannot use search forms. Some people believe that if they put a search box on their site, robots will be able to find everything that their users search for
Is text hidden within non-text content? Non text media forms (images, video, GIFs, etc.) should not be used to display text that you wish to be indexed. While search engines are better at recognising images, there’s no guarantee they will be able to read and understand it just yet. It’s always best to add text within the markup of your page.
Can search engines follow your site navigation? Just as a crawler needs to discover your site via links from other sites, it needs a path of links on your own site guiding it from page to page. If you’ve got a page you want search engines to find but it isn’t linked to from any other pages, it’s as good as invisible. Many sites make the critical mistake of structuring their navigation in ways that are inaccessible to search engines, hindering their ability to get listed in search results.
What are some common navigation mistakes that can keep crawlers from seeing all of your site?
- Having a mobile navigation that shows different results to your desktop navigation
- Any type of navigation where the menu items are not in the HTML such as Javascript-enabled navigations. Google has gotten much better at crawling and understanding Javascript, but it’s still not perfect. The most surefire way of making sure that something gets found by Google is by putting it in the HTML
- Personalisation, or showing unique navigation to a specific type of visitor versus others, could appear to be cloaking to a search engine crawler
- Forgetting to link to a primary page on your website through your navigation - remember, links are the paths crawlers follow to new pages. This is why it’s essential that your site has a clear navigation and helpful URL folder structures.
Do you have clean information architecture?
Information architecture is the practice of organising and labelling content on a website to improve efficiency and findability for users. The best information architecture is intuitive, meaning that users shouldn’t have to think very hard to navigate through your website.
Are you utilising site maps?
A sitemap is just as it sounds, a list of URLs that crawlers can use to discover and index your content. One of the easiest ways to ensure Google is finding your highest priority pages is to create a file that meets Google’s standards and submit it through Google Search Console. While submitting a sitemap doesn’t replace the need for good site navigation, it can certainly help crawlers follow the path to all of your important pages.
- Ensure that you’ve only included URLs that you want indexing by search engines, and be sure to give crawlers consistent directions. For example, don’t include a URL in your sitemap if you’ve blocked that URL via robots.txt or include URLs in your sitemap that are duplicates rather than the preferred, canonical version