Spiders and crawlers Flashcards

1
Q

Information Retrieval

A

A field concerned with the structure, analysis, organisation, storage, and retrieval of information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Goals of Search Engines

A

Effectiveness - quality - retrieve the most relevant set of documents possible

Efficiency - speed - process results from users as quickly as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Indexing Process

A

Text acquisition, transformation and index creation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Text acquisition

A

Store parsed data into a document data store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Text transformation

A

Remove duplicates from the store, classify information and organise data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Index creation

A

Create an index to quickly locate information - stored in an index database that can be queried by user

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Query Process

A

Interaction, evaluation and log data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Interaction

A

User interacts with document data store, which runs a ranking algorithm from the index to display most relevant results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Evaluation

A

Find out how relevant results are e.g. quiz user on how relevant the results are so the algorithm can be updated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Log data

A

Log user data and update algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Types of text acquisition

A

Crawler and feed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Crawler

A

Web crawler is the most common method of text acquisition: opens a webpage and looks for links

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Challenges faced by crawler

A

Pages are constantly updated so crawlers must be ran very frequently to keep up; may not be able to handle huge volumes of new pages

Can only operate on a single website

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Focused crawler

A

Classification technique to determine whether a page is relevant or not; will not access pages that are deemed irrelevant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Feed

A

Real time stream of documents e.g. a news feed

Search engine acquires new documents simply by monitoring the feed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Feed Conversion

A

Documents in feed are rarely plain text

Search engines require them to be converted into consistent text + metadata

17
Q

Document Data Store

A

Database to manage large numbers of documents and structured data (metadata) associated with them

18
Q

Types of text transformation

A

Parser, stopping and stemming

19
Q

Parser

A

Processes a sequence of text tokens

Uses knowledge of syntax to identify structure of text/information

20
Q

Stopping

A

Removes common words from the stream of tokens e.g. ‘the’, ‘of’, ‘to’, ‘for’

Reduces size of index and does not affect quality

21
Q

Stemming

A

Group words together that derive from a common stem e.g. ‘fish’, ‘fishes’ and ‘fishing’

May not be effective for all languages

22
Q

User interaction

A

Query input, query transformation and results output

23
Q

Query input

A

Small number of keywords to query from a document e.g. web query

24
Q

Query transformation

A

Tokenisation, stopping and stemming must be done to compare with document

25
Q

Results output

A

Construct display of ranked documents - snippets of documents, important words/passages etc

26
Q

Logging

A

One of the most valuable sources for tuning and improving search engines

Ranking analysis uses log data to compare effectiveness of algorithm

Simulations

27
Q

Politeness Policy

A

Some crawlers may have a limit to one page every x seconds instead of max speed

28
Q

Ratelimits

A

A limit of times a single IP address is allowed to access a server

29
Q

Deep Web

A

Private sites (accounts), form results (bus/flight timetables), scripted pages (javascript)

Difficult for a crawler to find

30
Q

Conversion Problem

A

Text document stored in incompatible formats; PDF, raw text, RTF, HTML, XML, others

Sometimes in PPT/Excel documents or obsolete formats

31
Q

Big Table

A

Used internally at Google

A distributed database built for storing web pages. A big table

Split into tablets served by thousands of machines, any changes are recorded in the transaction log

If a tablet crashes then another server can read data from transaction log and take over