Information retrieval Flashcards

Question 1

Q

What is the task of IR systems

Answer

A

Finding results that are similar to query

Question 2

Q

What is the difference of searching in IR vs database

Answer

A

a database result will always give you an exact match whereas IR systems will retrieve documents that are similar

Question 3

Q

Different retrieval techniques dependent on search query

Answer

A

non-textual objects = Meta description
Content - Bag of words
Semantic tagging = what is the meaning of the specieis of text
Link analysis - indicates importance of document by incoming links (authority)

Question 4

Q

which types of retrieval models are they

Answer

A

Boolean and vector space

Question 5

Q

Explain the boolean retrieval model

Answer

A

Each document is a bag of words, the user designs a boolean query where he or she can tell the search system in more detail what and how to search. Query contains boolean operator: And, Or, Not. Can only filter not sort

Question 6

Q

Explain the extended boolean retrieval model

Answer

A

The searcher has more control of the search process, the model consider text structure and distance between word when it matches the query to a piece of text. No rankning just sorting

Question 7

Q

Explain vector space retrieval model

Answer

A

A Vector is defined by its lenght and direction. Only coordinates are necessary to identify a vector length thanks to pythagoras sats

Question 8

Q

What is Bag of words

Answer

A

A set of ordered words in a document where the frequency of each word is indicated

The structure of the text is lost

Question 9

Q

common retrieval models

Answer

A

similarity between document vectors
term weight (measuring importance of word)
evaluation of retrieval

Question 10

Q

What is the purpose of term vector similarity

Answer

A

finding similarity between document vectors, where the query is more than a few words.

Question 11

Q

Explain term vector similarity between documents

Answer

A

Term-document vector space -
documents are represented as vectors in a n-dimensional space where each dimension/axis is a term/word and each vector coordinate is the weight of the term in the document. So if a word is present in a document it will get the value 1 (if binary) and so be on spot 1 for that axis. So the direction is the determined by the words in the text and the length is dependent of the amount of words in the document (not important)

Question 12

Q

How is similarity measured?

Answer

A

common terms = straight forward, count the nr of terms that q and d have in common
scalar product = multiply the coordinates of the vector (x1x2 + Y1Y2) in order to get the lenght, it is normalized based on the amount of words in the document.
Cosine similarity - similariy between 2 documents calculated as fucntion of the angle between the term vectors of these documents.

Question 13

Q

how is term weight used

Answer

A

measuring the importance of a word, instead of binary outcome, a higher number means a more important term

term frequency - how often the term appears in a document

inverted document frequency - how unique the term is in the collection of documents. Low IDF not unique, high - unique. total nr of doc/nr of doc containing the term.

Term weigth = frequency * inverted document frequency

high if frequent in a document and is unique for a subset of documents. Document and collection specific

Question 14

Q

How is retrieval evaluated

Answer

A

Precision and recall

Question 15

Q

What is precision

Answer

A

Fraction of retrieved documents that are relevant. How many relevant documents did we manage to retrieve?

Question 16

Q

What is recall

Answer

A

Fraction of relevant documents that retrieved. Out of all the relevant documents, how many did we manage to retrieve?

Question 17

Q

Precision and recall curve (interpolated)

Answer

A

Is like an average precision and recall curve, we obtain this by finding the larget measured precision value for all the recall values equal or larger (more to the right) of the given/standard recall and plot it on the Y-axis (precision) and we get interpolated precision value. We can summerize a lot of curves like this. It alway drops

Question 18

Q

Why does high recall means low precision and vice versa

Answer

A

Because, if we wish to get a higher recall we need to retrieve more documents in order to collect the majority of relevant documents, this will affect the precision since we will get more junk (unrelevant documents). If we wish to increase precision we need to retrieve less relevant documents but this will also mean we will miss more relevant documents and this will affect recall.

Question 19

Q

What is pooling and why do we do it?

Answer

A

Pooling is in order to manually determine relevance of a document, thus be able to evaluate retrieval. We do this by creating a small subset of documents that hopefully, but not necessarily contains all the relevant documents. So that we can manually verify the relevance and then define the total nr of relevant document in order to calculate recall. (t.ex. de 3000 första doc, hur många är relevanta och hur många fångar vi upp i de första 5 resultaten p)

Question 20

Q

How do we do query expansion

Answer

A

we add term synonyms to the original query, change the term weight in the original query in order to retrieve more relevant document, those that the original query missed out.

Question 21

Q

How does relevance feedback work and in particular - explicity relevance feedback carried out

Answer

A

Relevance feedback is either manually and then called explicit relevance feedback: the user indicates which retrieved document are relevant to query. The system modify the original query including the term that are represented in the newly relevant documents and the user submits the query to the system.

Otherwise it can be carried out inactively by user behavior analysis (links clicked, time spent)

Question 22

Q

what is document indexing

Answer

A

Transforming unstructured data into structured data. Ordered list of words. in order to speed up the process by searching in structured data that represents the document collection by information of which terms are present in which documents, how frequent and location in the document. This is done offline by crawlers. When you search a search engine it will perform text similariy by document index where each word is a dimension..

Question 23

Q

why is link anlysis important for search engines

Answer

A

A page that recieves a lot of links from pages in the same topic is an authority of the topic. Authorities are important pages. A link is like an vote from some page, many means high importance and it’s not easly faked.

Question 24

Q

What is page rank

Answer

A

A PR show how important the page is in term of the nr of incomming links and the importance of the linking pages.

Authority of a link count
Inbound links create weights –> outbound links transfer weights

Question 25

Q

What is the propose of Search engine optimization (SEO)

Answer

A

In order to improve the volume and quality of traffic to a website from search engines via natural search results

Question 26

Q

What do we optimize when we do SEO? (the five building blocks of SEO)

Answer

A

Targeted keywords: Keywords and links associated with a webpage
Search friendly site: Crawler friendliness of the website
Inbound links –> impression of a popular website with fresh and user generated content
Site authority
Mobile-device friendliness of the website

Question 27

Q

What SEO can we perform on a page

Answer

A

Keyword specific for content on page
Page rank assigned to one page
link text that describes one page

Question 28

Q

What SEO can we perform on entire website

Answer

A

Crawler friendly website
domain trust
link diversity
geo targeting signals

Question 29

Q

Explain targeted keywords (SEO)

Answer

A

You need to know what keyword is going to get sales..
Check out for competition, use unique keywords, consider slang, long-tail keywords.

Evaluation of keywords –> opportunity with keyword = popularity of keyword/competition for keyword

Question 30

Q

What is long-tail keywords

Answer

A

Long-tail keywords have more words that more precisicely describe the information need. Therefore it is more likely that the page will be found by a searcher who need that page.

Question 31

Q

How to create a search friendly site

Answer

A

Site map
Robots exlusion protocol
Redirection of links
fix broken links
avoid frames
avoid flash (doesn’t work on devices)
page load time under 3 sec
unique and fresh content
canonical URLs
evergreen content

Question 32

Q

What is canonical URLs

Answer

A

A set of similar pages each containing a reference to a preferred page in the set. The search engine will in this way know that only the prefered page is relevant and ignore the similar pages. Thus avoid the problem of duplicate content and don’t divide the page rank

Question 33

Q

what is Evergreen content

Answer

A

Websites need content, a subsection that does not get our of date, continously attract visitors and external links. Always being relevant, sustainable and long lasting so traffic always grows over time.

Question 34

Q

What defines natural link structure of inbound links?

Answer

A

Inbound link anchor text varies
amount of inbound links increases gradually
Good and reliable links
links are rarely reciprocal

Question 35

Q

What defines artifical link structure of inbound links?

Answer

A

inbound link anchor text is identical
inbound link count increase suddenly
links come from link farms and web rings
many links are reciprocal

Question 36

Q

What is considered good links?

Answer

A

from trusted domains, on-topic authorities, high page rank, link diversity, deep linking

Question 37

Q

What is considered a bad link

Answer

A

a link from every page of another site, low quality sites, off-topic sites (nofollow if needed)

Question 38

Q

What types of mobile SEO can one consider

Answer

A

Dedicated mobile & desktop (2 sites) or Responsive Web design

Question 39

Q

Why mobile SEO?

Answer

A

If your site is not mobile friendly it will not be considered by google. Mobile devices can’t see flash, popups and javascript.
Search from mobile device will more likely be location specific and if considered this will rank you higher

Question 40

Q

pros and cons of dedicated mobile & desktop site

Answer

A

+ optimized compact mobile site
+ to be considered if you have a lot of mobile search visitors (like facebook)
- 2 sites to maintain
- mobile site doesn’t accumulate authority and thus not rank well. less text to index
- the desktop need to recognize mobile crawlers and redirect them

Question 41

Q

What is and Pros & cons of responsive web design (adaptive web design)

Answer

A

Responsive web design is when the same site is used for mobile and desktop search results, it is the same site but in a different format. The site is squeezed to adjust the screen of the device. The size of the web page element such as images and tables is defined proportional to the size of the browser window, it is not defined in pixels rather “flexible size” “proportion-based grids”

\+ one site to maintain
\+ all page authority is accumulated in one place
\+ mobile users have full access to the content 
\+ google preffered method 
- optimized mobile page load faster 
- comprize the web site 
- mobile only features 
- older phonde do not work well

Question 42

Q

What is black hat SEO

Answer

A

Aggressive SEO strategies unethical, that focuses on search engines and not on a human adience.

Keyword spamming, stuffing –> overuse of keywords, meta tags and link text

spam-dexing - Invisible or semi-visible text, write a lot of keywords and links and cover with picture or a flash movie over.

Question 43

Q

What is white hat SEO

Answer

A

Refers to ethical SEO strategies. optimize the page for both humans and search engines according to manuals.