MOZ HOW SEARCH ENGINES WORK: CRAWLING, INDEXING, AND RANKING chapter 2 Flashcards
What are search engines and what do they do?
search engines are answer machines. They exist to discover, understand, and organize the internet’s content in order to offer the most relevant results to the questions searchers are asking.
How do search engines work?
Search engines work through three primary functions:
Crawling: Scour the Internet for content, looking over the code/content for each URL they find.
Indexing: Store and organize the content found during the crawling process. Once a page is in the index, it’s in the running to be displayed as a result to relevant queries.
Ranking: Provide the pieces of content that will best answer a searcher’s query, which means that results are ordered by most relevant to least relevant.
What is search engine crawling?
Crawling is the discovery process in which search engines send out a team of robots (known as crawlers or spiders) to find new and updated content.
How is content discovered?
content is discovered by links regardless or format (pdf, image, blog, etc.)
What is a search engine index?
a huge database of all the content they’ve discovered and deem good enough to serve up to searchers.
What is search engine ranking?
ordering of search results by relevance is known as ranking. In general, you can assume that the higher a website is ranked, the more relevant the search engine believes that site is to the query.
How can you check how many of your websites pages are in the index?
Head to Google and type “site:yourdomain.com” into the search bar. This will return results Google has in its index for the site specified. However, the number Google displays isn’t exact. For more accurate results you can use Google Search Console.
What are Robot.txt files?
Robots.txt files are publicly accessible and are located in the root directory of websites
(ex. yourdomain.com/robots.txt) and suggest which parts of your site search engines should and shouldn’t crawl, as well as the speed at which they crawl your site
How does Googlebot treat robots.txt files?
If Googlebot can’t find a robots.txt file for a site, it proceeds to crawl the site.
If Googlebot finds a robots.txt file for a site, it will usually abide by the suggestions and proceed to crawl the site.
If Googlebot encounters an error while trying to access a site’s robots.txt file and can’t determine if one exists or not, it won’t crawl the site.
How can help Googlebot find your important pages?
Ask yourself this: Can the bot crawl through your website, and not just to it?
Search engine crawlers can’t see past login pages, they cant use search forms and can’t read images very well. It’s always best to add text within the markup of your webpage.
Instead, it needs a path of links on your own site to guide it from page to page.
What are the common navigation mistakes that can keep crawlers from seeing all of your site?
Having a mobile navigation that shows different results than your desktop navigation
Any type of navigation where the menu items are not in the HTML, such as JavaScript-enabled navigations. Google has gotten much better at crawling and understanding Javascript, but it’s still not a perfect process. The more surefire way to ensure something gets found, understood, and indexed by Google is by putting it in the HTML.
Personalization, or showing unique navigation to a specific type of visitor versus others, could appear to be cloaking to a search engine crawler
Forgetting to link to a primary page on your website through your navigation — remember, links are the paths crawlers follow to new pages!
What is information architecture and what’s the best version of it?
Information architecture is the practice of organizing and labeling content on a website to improve efficiency and findability for users. The best information architecture is intuitive, meaning that users shouldn’t have to think very hard to flow through your website or to find something.
What is a sitemap?
a list of URLs on your site that crawlers can use to discover and index your content.