06: Search Engines Flashcards

1
Q

Three Main Components of a Search Engine

A
  1. input agents
  2. database engine
  3. the query server

* in practice, these three components are distributed but conceptually can be thought of as services on the same machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Input Agents

(Three Main Components of a Search Engine)

A

web crawlers that surf the WWW requesting and downloading web pages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Database Engine

(Three Main Components of a Search Engine)

A

manages the URLs and the input agents in general

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Components of a Search Engine

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Web Crawlers

A

Refers to a class of software that:

  • downloads pages
  • identifies the hyperlinks
  • adds links to a database for future crawling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Code for Simplified Crawler

A

//namespaces implied

public partial class _Default : System.Web.UI.Page {

protected void Page_Load(object sender, EventArgs e) {

//content in image

}

}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Code for Putting Web Crawler Results to a Table

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Code for a Recursive Crawler

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Robots Exclusion Standard

A

Implemented with plain text files named robots.txt stored at the root of the domain and has two syntactic elements:

  • user-agent we want to make a rule for (the special character * means all agents)
  • one disallow directive per line to identify patterns

​​​Regular expressions are not supported

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Scrapers

A
  • programs that identify certain pieces of information from the web to be stored in databases
  • sometimes combined with Crawlers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Scraper Classes

A
  • URL scrapers
  • Email scrapers
  • Word scrapers
  • Media scrapers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Word Scrapers

A
  • may want to parse out all of the text within a web page
  • words are the most difficult content to parse since the tags they appear in reflect how important they are to the page overall
    • words in large font more important than small ones at the bottom of a page
    • words that appear next to one another should be linked while words that are at opposite ends of a page or sentence are less related
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

To understand indexing, consider what a _____ and a _____ might identify from a web page and how they might _____ it.

A

To understand indexing, consider what a crawler and a scraper might identify from a web page and how they might store it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Reverse Index

A
  • indexes the words rather than the URLs
  • mechanics of how this is done is not standardized
    • generally, word tables are created (for every word found in pages) so that each word can be referenced by a unique integer, and indexes of these references can be built for faster searches
  • demands on these indexes far exceed what a single database server can support
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

PageRank

A

method for computing a ranking for every web page based on the graph of the web

(graph of the web = hyperlinks between web pages)

* sites with thousands of backlinks are more important than sites with only a handful

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

PageRank Definition Equation

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

If a page has no links to other pages, it becomes a ____, and therefore ____ the ____ ____ ____. If the random surfer arrives at a sink page, it picks another ____ at random and continues surfing again.

A

If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process. If the random surfer arrives at a sink page, it picks another URL at random and continues surfing again.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

The PageRank Theory holds that an imaginary surfer who is randomly clicking on links will eventually ____ ____.

A

The PageRank Theory holds that an imaginary surfer who is randomly clicking on links will eventually stop clicking.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

PageRank Algorithm Factors

A

Modern ranking algorithms take much more into account than simple backlinks, including:

  • Search History
  • Geographic Location
  • Authorship
  • Freshness of the pages
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Search Engine Optimization (SEO)

A

process a webmaster undertakes to make a website more appealing to search engines, and by doing so, increases its ranking in search results for terms the webmaster is interested in targeting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Two Categories of SEO

A
  • White-hat SEO: tries to honestly and ethically improve your site for search engines
  • Black-hat SEO: tries to game the results in your favor
22
Q

Importance of Title Tags

(White Hat SEO)

A
  • “title” tag in “head” single most important tag to optimize for seach engines
    • make it unique on each page of your site
    • include enough keywords to make it relevant
23
Q

Meta Tags

(White Hat SEO)

A
  • indexing meta tags less data-intensive than trying to index entire pages
  • meta tags include:
    • Description
    • Keywords
    • Robots
    • http-equiv
  • ​contains a human-readable summary of your site
24
Q

Http-Equiv Meta Tags

(White Hat SEO)

A
  • tags that use the http-equiv attribute can perform HTTP-like operations like redirects and set headers
25
Q

Write an http-equiv meta tag indicating the page should not be cached.

A
26
Q

Write an http-equiv meta tag to redirect to http://funwebdev.com/destination.html after five seconds.

A
27
Q

URLs

(White Hat SEO)

A
  • Search engines must by definition download and save URLs since they identify the link to the resource
28
Q

How to improve a bad SEO URL? Why should they be improved?

A
  • Bad SEO URLs work fine for programs but cannot be read by humans
  • Can be improved by adding:
    • Descriptive path components
    • Descriptive file names
29
Q

Site Design

(White Hat SEO)

A
  • Sites that rely heavily on JavaScript or Flash for their content and navigation will suffer from poor indexing
30
Q

What should you do if your site includes a hierarchial menu to improve indexing?

A

nest the navigation inside of “nav” tags to demonstrate semantically that these links exist to navigate your site

31
Q

SiteMaps

(White Hat SEO)

A
  • Formal framework that captures website structure
  • Using XML, defines a URl set for the root item, then as many URL items as desired for the site
32
Q

Anchor Text

(White Hat SEO)

A
  • Anchor text of links indexed along with backlinks
    • e.g. “Click here”
  • Use of anchor text is not encouraged as it says little about what will be at that URL
    • ​e.g. link to a page of services should read Services and Rates
33
Q

Images

(White Hat SEO)

A
  • The file name is first element that can be optimized since it can be parsed for words
    • e.g. instead of 1.png, it should be rose.png
  • Use the alt attribute to give a textual description of the image that can help site ranking
  • Utilize anchor text if there is a link to the image
    • e.g. instead of “Full size”, it should be “Full size image of a red rose”
34
Q

Content

(White Hat SEO)

A
  • Search engines tend to prefer pages that are updated regularly over those who are static
  • If your website can permit users to comment or write content on your site, you should consider enabling it
  • The idea of having users generate content is now extremely important
35
Q

Why should you not use Black Hat SEO techniques?

A

Google and other search engines may punish or ban your site from their results

36
Q

Content Spamming

(Black Hat SEO)

A

Any technique that uses the content of a website to try and manipulate search engine results

37
Q

Four Methods of Content Spamming

(Black Hat SEO)

A
  • Keyword stuffing
  • Hidden content
  • Paid links
  • Doorway pages
38
Q

Keyword Stuffing

(Black Hat SEO - Content Spamming)

A
  • A technique whereby you purposely add keywords into the site in a most unnatural way with the intention of increasing the affiliation between certain key items and your URL
    • As keywords are added throughout a web page, the content becomes diluted with them
    • Meaningful sentences are replaced with content written primarily for robots, not humans
    • Any technique where you find yourself writing for robots before humans, as a rule of thumb, is discouraged
39
Q

Hidden Content

(Black Hat SEO - Content Spamming)

A

making irrelevant words the same color as the background to hide them

40
Q

Paid Links

(Black Hat SEO - Content Spamming)

A
  • Frowned upon by many search engines since their intent is to discover good content by relying on referrals (i.e. backlinks)
  • Purchased advertisements on a site are not considered paid links so long as they are well identified as such, and are not hidden in the body of a page
  • Many link affiliated programs (like Google’s own AdWords) do not impact PageRank because the advertisements are shown using JavaScript
41
Q

Doorway Pages

(Black Hat SEO - Content Spamming)

A
  • Pages written to be indexed by search engines and included in search results
  • Normally crammed full of keywords, and effectively useless to real users of your site
  • These doorway pages then link to your home page, which you are trying to boost in the search results
42
Q

Four Link Spam Techniques

(Black Hat SEO)

A
  • Hidden links
  • Comment spam
  • Link farms
  • Link pyramids
43
Q

Hidden Links

(Black Hat SEO - Link Spamming)

A
  • Same as hidden content
  • With hidden links websites hide the color of the link to match the background, hoping that
    • Real users will not see the links
    • Search engines will follow the links, thus manipulating the search engine without impacting the human reader.
44
Q

Comment Spam

(Black Hat SEO - Link Spamming)

A

Automated process utilizing bots that scour the web for comment sections and leave poorly auto-written spam with backlinks to their sites

(* be sure to secure a comment section on your site or you will be flagged as a source of comment spam)

45
Q

Link Farms

(Black Hat SEO - Link Spamming)

A

Set of websites that all interlink each other with the intent of sharing any incoming PageRank to any one site with all the sites that are members of the link farm

46
Q

Link Pyramids

(Black Hat SEO - Link Spamming)

A
  • Similar to link farms in that there is a great deal of interlinking
  • Unlike a link farm, a pyramid has the intention of promoting one or two sites
47
Q

Other Spam Techniques

(Black Hat SEO)

A
  • Google Bowling
  • Cloaking
  • Duplicate content
48
Q

Google Bowling

(Black Hat SEO - Other Techniques)

A
  • Requires masquerading as the site you want to weaken/remove
    • black-hat techniques are applied as though you were working on their behalf. This might include subscribing to link farms, keyword stuffing, commenting on blogs, and more
    • report the competitors’ website to Google for all the black-hat techniques they employed!
49
Q

Cloaking

(Black Hat SEO - Other Techniques)

A
  • Process of identifying crawler requests and serving them content different from regular users
  • A simple script can redirect users if googlebot is the user-agent to a page, normally stuffed with keywords
50
Q

Duplicate Content

(Black Hat SEO - Other Techniques)

A
  • Stealing content to build a fake site
  • To attribute content to yourself use the rel=author attribute

(* Google has also introduced a concept called Google authorship through their Google+ network to attribute content to the originator.)

  • Sometimes you have several versions of a page, for example, a display and print version
  • To prevent being penalized, you can use the canonical tag in the head section of duplicate pages to affiliate them with a single canonical version to be indexed
51
Q

What are the main capabilities required by an advanced search engine for processing semantic web?

A
  • Triplet processing
  • Lack of Sensitivity to Vocabulary
  • Extracting Information from Several Resources
  • Searching RDF (OWL) ontologies
  • Merging ontologies
  • Integrating knowledge from different sources
  • Having Web Inference Capability
  • Efficiency in crawling, page ranking, and indexing
  • Handling trust
  • Valuing Security
52
Q

A majority of mentioned elements from crawling through page ranking to indexing are all needed for future semantic web engines as well!

The main difference is that they need to deal with _____ _____ rather simple ____ files.

_____ play key roles for such possible future semantic web engines.

A

The main difference is that they need to deal with federative datasets rather simple HTML files.

URI play key roles for such possible future semantic web engines.