C.2 Searching the Web Flashcards
C.2.2
Distinguish between the surface web and the deep web
Surface web:
* Pages that are reachable (and indexed) by a search engine
* Pages that can be reached through links from other sites in the surface web
* Pages that do not require special access configurations
Deep web:
* Pages not reachable by search engines
* Substantially larger than the surface web
* (for example, parts of websites that need authentication access, private social media, emails. Or content which is blocked by paywalls, newspapers, netflix)
C.2.3
Outline the principles of searching algorithms used by search engines
- The time a page has existed
- The time a page takes to load
- Dwell time (how long does the user stays on the website)
- The frequency of search keywords on the page
C.2.3
What is the Page Rank Algorithm?
PageRank works by counting the number and quality of backlinks to a page to determine a rough estimate of how important the website is. A page with more backlinks is considered more important.
C.2.3
What is the HITS Algorithm?
HITS algorithm splits sites into hubs and authorities.
Authorities have a lot of inlinks. It contains valuable information that the user wants. An authority is considered good if it is linked by a lot of high quality hubs.
Hubs contain outlinks to authorities. A hub is considered good if it links to a lot of high quality authorities.
C.2.4
Describe how a web crawler functions
A web crawler crawls through the web and downloads and indexes webpages from all over the internet. For each page it indexes, it extracts all the links in the webpage and adds it to the list of webpages to crawl.
The goal of such a bot is to learn what (almost) every webpage on the web is about, so that the information can be retrieved when it’s needed.
C.2.5
Discuss the relationship between data in a meta tag and how it is accessed by a web-crawler
Meta tags are tags that are only meant for computers to read. They tell computers what the website is about.
The description meta-tag provides the indexer with a short description of the page.
The keywords meta-tag provides…well keywords about your page.
C.2.6
Discuss the use of parallel web-crawling
The web is growing at an astonishing pace. As such, it is necessary to parallelise the crawling process to speed it up.
Advantages
* Faster
* Network load dispersion: as the web is geographically dispersed, dispersing crawlers disperses the network load
Disadvantages
* Web crawlers may overlap and index the same page more than once
* Parallel web crawlers need to communicate with each other to effectively crawl the web. This takes up communication bandwidth
C.2.7
Outline the purpose of web-indexing in search engines
Indexing websites allow search engines to quickly locate relevant information for users. Information is stored about the indexed websites, like its ranking, relevant keywords and metadata. This helps search engines rank websites and give helpful information based on search queries.
C.2.8-9
Suggest how developers can create pages that appear more prominently in search engine results. Describe the different metrics used by search engines.
- How many websites link to this website.
- The clickthrough rate (how likely a user is to click on your website)
- The bounce rate (how likely a user is to immediately leave your site after clicking)
- Dwell time (how long a user stays on your webpage)
- Using more semantic tags in your HTML which tell the bot what your website is about (article tags, section tags, h1 tag, h2 tag, footer tag)
C.2.11
Discuss the use of white hat search engine optimisation
- Guest blogging: Writing a blog post in someone else’s blog. At the end of the blog post you can insert a link to your site, thereby increasing the number of incoming links to your site.
- Quality content: Writing quality content encourages users to stay longer, increasing dwell time.
- Link Baiting: Getting users to click on their link, increasing click through rate.
C.2.11
Discuss the use of black hat search engine optimisation
- Keyword stuffing
- Link farming: Creating groups of websites with hyperlinks that all link to your own.
- Blog comment spamming: Automated posting of hyperlinks for promotion on any kind of publicly accessible online discussion board
C.2.12
Outline future challenges to search engines as the web continues to grow
As the web grows, it becomes harder to filter out the most relevant information, and paid results (ads) play an important role.