Search Engines and Crawlers Flashcards

Question 1

Q

Vertical search

Answer

A

More limited than web search (e.g. only certain file formats will be shown)

Question 2

Q

Enterprise search

Answer

A

Searching for company documentation

Question 3

Q

Desktop search

Answer

A

Searching for data inside files

Question 4

Q

Classification

Answer

A

Compares documents

Question 5

Q

Ad-hoc search

Answer

A

Searching through unstructured data

Question 6

Q

Relevance in information retrieval

Answer

A

o Topical relevance and user relevance
o Retrieval models – how the results will be presented (e.g. a list of links)
o Ranking algorithms

Question 7

Q

Evaluation in information retrieval

Answer

A

o Precision and recall – when multiple users search for the same thing, they should be shown the same result
o Test collections
o Clickthrough and log data

Question 8

Q

Information needs

Answer

A

o Query suggestions (e.g. autofill)
o Query expansion – providing other potentially relevant data
o Relevance feedback (e.g. showing how many times an academic paper has been cited)

Question 9

Q

What are the primary goals of a search engine?

Answer

A

Effectiveness – retrieving the most relevant set of documents possible
Efficiency – processing queries as quickly as possible

Question 10

Q

Issues in search engines

Answer

A

o Performance - efficient searching and indexing
o Incorporating new data
o Scalability – growing with data and users (e.g. handling large amounts of traffic)
o Adaptability – tuning for applications (e.g. adapting for use on a variety of devices)

Question 11

Q

Document statistics

Answer

A

Gathering and recording statistical information about words and documents

Question 12

Q

How are document statistics used?

Answer

A

The gathered information is stored in lookup tables and used by ranking algorithms

Question 13

Q

Lookup table

Answer

A

Data structure designed for quick retrieval

Question 14

Q

Weighting

Answer

A

Calculating weight using document statistics and storing it in a lookup table

Question 15

Q

tf.idf weighting

Answer

A

Giving high weights to terms that appear in very few documents

Question 16

Q

True or false: Weight is calculated during the query process

Answer

A

False! It can be calculated as part of the query process, but calculating during indexing makes querying more efficient

Question 17

Q

Inversion

Answer

A

Changing document-term info into term-document info

Question 18

Q

Methods of query transformation

Answer

A

o Spell checking
o Query suggestion
o Suggesting additional terms via query expansion

Question 19

Q

When search results are displayed, snippets are generated to…

Answer

A

o Summarise retrieved documents
o Identify related groups of documents
o Highlight important words and passages

Question 20

Q

Document data store

Answer

A

A database that manages large numbers of documents and the structured data (usually metadata or links) associated with them

Question 21

Q

True or false: A document data store can be stored in a relational database

Answer

A

True, but some applications use other storage systems for faster retrieval

Question 22

Q

Scoring

Answer

A

Calculating scores for documents using ranking

Question 23

Q

Performance optimisation

Answer

A

Designing ranking algorithms to decrease response time and increase query throughput

Question 24

Q

Methods of distributing ranked documents

Answer

A

o Query broker

o Caching results of common search queries

Question 25

Q

Parser

Answer

A

Processes sequences of text tokens (usually words)

Question 26

Q

Stopping

Answer

A

Removes common words from text tokens to reduce repetition and index size

Question 27

Q

Stemming

Answer

A

Grouping words that have a similar meaning (e.g. “fish” and “fishing”) and replacing them with a designated word

Question 28

Q

Why is stemming used?

Answer

A

Words in queries and documents are more likely to match

Question 29

Q

Information extraction

Answer

A

Using syntactic analysis to identify complex index terms

Question 30

Q

Classifier

Answer

A

o Identify class-related metadata
o Assign labels to documents representing topic categories
o Group documents without pre-defined categories

Question 31

Q

Crawler

Answer

A

Follows links to web pages to discover and download new pages

Question 32

Q

Web crawling

Answer

A

Client program connects to domain name system (DNS) server
DNS server translates host name into an internet protocol (IP) address
Program attempts to connect to computer with that IP address
Once connection is established, client program sends a HTTP request to the server

Question 33

Q

Politeness policies

Answer

A

Standards that aim to reduce a crawler’s impact on a web server’s performance. This could include a politeness window (time between requesting pages) being prevented from accessing certain pages

Question 34

Q

Focused crawling

Answer

A

Relies on web pages linking to other pages on the same topic

Question 35

Q

How does focused crawling work?

Answer

A

Can use popular pages as seeds
Use text classifiers to determine what page is about
If page is on topic, keeps the page and uses its links to find related sites

Question 36

Q

How does a focused crawler decide which pages to visit next?

Answer

A

Tracking topicality of downloaded pages to determine whether to download similar pages
Anchor text data and topicality data can be combined to determine which pages to visit next

Question 37

Q

Deep web

Answer

A

Pages that are difficult for a crawler to find

Question 38

Q

Examples of deep web pages

Answer

A

Sites that require an account
Form results (e.g. flight timetables, product search)
Scripted pages

Question 39

Q

How can web pages become easier to find?

Question 40

Q

Issues in crawlers and how to solve them

Answer

A

Text documents stored in incompatible formats - can be converted to tagged formats
Crawling can be expensive in terms of CPU and network load - may be useful to store documents

Question 41

Q

BigTable

Answer

A

A distributed database system in which the table is split into small pieces (tablets), which are served by thousands of machines

Question 42

Q

How are changes recorded in BigTable?

Answer

A

Changes are recorded in a transaction log and stored in a shared file system

Question 43

Q

True or false: If a BigTable tablet server crashes, the whole table is inaccessible

Answer

A

False! Another server will immediately read the tablet data and transaction log and take over

Question 44

Q

True or false: BigTable is a relational database model

Answer

A

False! Unlike relational databases, not all rows have the same columns.

Question 45

Q

Feed

Answer

A

Real time stream of files; this is how search engines acquire new documents

Question 46

Q

Push feed

Answer

A

Alerts subscribers to new documents

Question 47

Q

Pull feed

Answer

A

Requires the subscriber to check periodically