Search Engines and Crawlers Flashcards

1
Q

Vertical search

A

More limited than web search (e.g. only certain file formats will be shown)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Enterprise search

A

Searching for company documentation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Desktop search

A

Searching for data inside files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Classification

A

Compares documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Ad-hoc search

A

Searching through unstructured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Relevance in information retrieval

A

o Topical relevance and user relevance
o Retrieval models – how the results will be presented (e.g. a list of links)
o Ranking algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Evaluation in information retrieval

A

o Precision and recall – when multiple users search for the same thing, they should be shown the same result
o Test collections
o Clickthrough and log data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Information needs

A

o Query suggestions (e.g. autofill)
o Query expansion – providing other potentially relevant data
o Relevance feedback (e.g. showing how many times an academic paper has been cited)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the primary goals of a search engine?

A
  1. Effectiveness – retrieving the most relevant set of documents possible
  2. Efficiency – processing queries as quickly as possible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Issues in search engines

A

o Performance - efficient searching and indexing
o Incorporating new data
o Scalability – growing with data and users (e.g. handling large amounts of traffic)
o Adaptability – tuning for applications (e.g. adapting for use on a variety of devices)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Document statistics

A

Gathering and recording statistical information about words and documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How are document statistics used?

A

The gathered information is stored in lookup tables and used by ranking algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Lookup table

A

Data structure designed for quick retrieval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Weighting

A

Calculating weight using document statistics and storing it in a lookup table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

tf.idf weighting

A

Giving high weights to terms that appear in very few documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

True or false: Weight is calculated during the query process

A

False! It can be calculated as part of the query process, but calculating during indexing makes querying more efficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Inversion

A

Changing document-term info into term-document info

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Methods of query transformation

A

o Spell checking
o Query suggestion
o Suggesting additional terms via query expansion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When search results are displayed, snippets are generated to…

A

o Summarise retrieved documents
o Identify related groups of documents
o Highlight important words and passages

20
Q

Document data store

A

A database that manages large numbers of documents and the structured data (usually metadata or links) associated with them

21
Q

True or false: A document data store can be stored in a relational database

A

True, but some applications use other storage systems for faster retrieval

22
Q

Scoring

A

Calculating scores for documents using ranking

23
Q

Performance optimisation

A

Designing ranking algorithms to decrease response time and increase query throughput

24
Q

Methods of distributing ranked documents

A

o Query broker

o Caching results of common search queries

25
Q

Parser

A

Processes sequences of text tokens (usually words)

26
Q

Stopping

A

Removes common words from text tokens to reduce repetition and index size

27
Q

Stemming

A

Grouping words that have a similar meaning (e.g. “fish” and “fishing”) and replacing them with a designated word

28
Q

Why is stemming used?

A

Words in queries and documents are more likely to match

29
Q

Information extraction

A

Using syntactic analysis to identify complex index terms

30
Q

Classifier

A

o Identify class-related metadata
o Assign labels to documents representing topic categories
o Group documents without pre-defined categories

31
Q

Crawler

A

Follows links to web pages to discover and download new pages

32
Q

Web crawling

A
  • Client program connects to domain name system (DNS) server
  • DNS server translates host name into an internet protocol (IP) address
  • Program attempts to connect to computer with that IP address
  • Once connection is established, client program sends a HTTP request to the server
33
Q

Politeness policies

A

Standards that aim to reduce a crawler’s impact on a web server’s performance. This could include a politeness window (time between requesting pages) being prevented from accessing certain pages

34
Q

Focused crawling

A

Relies on web pages linking to other pages on the same topic

35
Q

How does focused crawling work?

A
  • Can use popular pages as seeds
  • Use text classifiers to determine what page is about
  • If page is on topic, keeps the page and uses its links to find related sites
36
Q

How does a focused crawler decide which pages to visit next?

A
  • Tracking topicality of downloaded pages to determine whether to download similar pages
  • Anchor text data and topicality data can be combined to determine which pages to visit next
37
Q

Deep web

A

Pages that are difficult for a crawler to find

38
Q

Examples of deep web pages

A
  • Sites that require an account
  • Form results (e.g. flight timetables, product search)
  • Scripted pages
39
Q

How can web pages become easier to find?

A

Sitemaps

40
Q

Issues in crawlers and how to solve them

A
  • Text documents stored in incompatible formats - can be converted to tagged formats
  • Crawling can be expensive in terms of CPU and network load - may be useful to store documents
41
Q

BigTable

A

A distributed database system in which the table is split into small pieces (tablets), which are served by thousands of machines

42
Q

How are changes recorded in BigTable?

A

Changes are recorded in a transaction log and stored in a shared file system

43
Q

True or false: If a BigTable tablet server crashes, the whole table is inaccessible

A

False! Another server will immediately read the tablet data and transaction log and take over

44
Q

True or false: BigTable is a relational database model

A

False! Unlike relational databases, not all rows have the same columns.

45
Q

Feed

A

Real time stream of files; this is how search engines acquire new documents

46
Q

Push feed

A

Alerts subscribers to new documents

47
Q

Pull feed

A

Requires the subscriber to check periodically