Information retrieval Flashcards

1
Q

What is the task of IR systems

A

Finding results that are similar to query

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference of searching in IR vs database

A

a database result will always give you an exact match whereas IR systems will retrieve documents that are similar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Different retrieval techniques dependent on search query

A

non-textual objects = Meta description
Content - Bag of words
Semantic tagging = what is the meaning of the specieis of text
Link analysis - indicates importance of document by incoming links (authority)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

which types of retrieval models are they

A

Boolean and vector space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain the boolean retrieval model

A

Each document is a bag of words, the user designs a boolean query where he or she can tell the search system in more detail what and how to search. Query contains boolean operator: And, Or, Not. Can only filter not sort

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Explain the extended boolean retrieval model

A

The searcher has more control of the search process, the model consider text structure and distance between word when it matches the query to a piece of text. No rankning just sorting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain vector space retrieval model

A

A Vector is defined by its lenght and direction. Only coordinates are necessary to identify a vector length thanks to pythagoras sats

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Bag of words

A

A set of ordered words in a document where the frequency of each word is indicated

The structure of the text is lost

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

common retrieval models

A
  • similarity between document vectors
  • term weight (measuring importance of word)
  • evaluation of retrieval
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the purpose of term vector similarity

A

finding similarity between document vectors, where the query is more than a few words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain term vector similarity between documents

A

Term-document vector space -
documents are represented as vectors in a n-dimensional space where each dimension/axis is a term/word and each vector coordinate is the weight of the term in the document. So if a word is present in a document it will get the value 1 (if binary) and so be on spot 1 for that axis. So the direction is the determined by the words in the text and the length is dependent of the amount of words in the document (not important)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How is similarity measured?

A
  • common terms = straight forward, count the nr of terms that q and d have in common
  • scalar product = multiply the coordinates of the vector (x1x2 + Y1Y2) in order to get the lenght, it is normalized based on the amount of words in the document.
  • Cosine similarity - similariy between 2 documents calculated as fucntion of the angle between the term vectors of these documents.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

how is term weight used

A

measuring the importance of a word, instead of binary outcome, a higher number means a more important term

term frequency - how often the term appears in a document

inverted document frequency - how unique the term is in the collection of documents. Low IDF not unique, high - unique. total nr of doc/nr of doc containing the term.

Term weigth = frequency * inverted document frequency

high if frequent in a document and is unique for a subset of documents. Document and collection specific

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How is retrieval evaluated

A

Precision and recall

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is precision

A

Fraction of retrieved documents that are relevant. How many relevant documents did we manage to retrieve?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is recall

A

Fraction of relevant documents that retrieved. Out of all the relevant documents, how many did we manage to retrieve?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Precision and recall curve (interpolated)

A

Is like an average precision and recall curve, we obtain this by finding the larget measured precision value for all the recall values equal or larger (more to the right) of the given/standard recall and plot it on the Y-axis (precision) and we get interpolated precision value. We can summerize a lot of curves like this. It alway drops

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why does high recall means low precision and vice versa

A

Because, if we wish to get a higher recall we need to retrieve more documents in order to collect the majority of relevant documents, this will affect the precision since we will get more junk (unrelevant documents). If we wish to increase precision we need to retrieve less relevant documents but this will also mean we will miss more relevant documents and this will affect recall.

19
Q

What is pooling and why do we do it?

A

Pooling is in order to manually determine relevance of a document, thus be able to evaluate retrieval. We do this by creating a small subset of documents that hopefully, but not necessarily contains all the relevant documents. So that we can manually verify the relevance and then define the total nr of relevant document in order to calculate recall. (t.ex. de 3000 första doc, hur många är relevanta och hur många fångar vi upp i de första 5 resultaten p)

20
Q

How do we do query expansion

A

we add term synonyms to the original query, change the term weight in the original query in order to retrieve more relevant document, those that the original query missed out.

21
Q

How does relevance feedback work and in particular - explicity relevance feedback carried out

A

Relevance feedback is either manually and then called explicit relevance feedback: the user indicates which retrieved document are relevant to query. The system modify the original query including the term that are represented in the newly relevant documents and the user submits the query to the system.

Otherwise it can be carried out inactively by user behavior analysis (links clicked, time spent)

22
Q

what is document indexing

A

Transforming unstructured data into structured data. Ordered list of words. in order to speed up the process by searching in structured data that represents the document collection by information of which terms are present in which documents, how frequent and location in the document. This is done offline by crawlers. When you search a search engine it will perform text similariy by document index where each word is a dimension..

23
Q

why is link anlysis important for search engines

A

A page that recieves a lot of links from pages in the same topic is an authority of the topic. Authorities are important pages. A link is like an vote from some page, many means high importance and it’s not easly faked.

24
Q

What is page rank

A

A PR show how important the page is in term of the nr of incomming links and the importance of the linking pages.

Authority of a link count
Inbound links create weights –> outbound links transfer weights

25
Q

What is the propose of Search engine optimization (SEO)

A

In order to improve the volume and quality of traffic to a website from search engines via natural search results

26
Q

What do we optimize when we do SEO? (the five building blocks of SEO)

A
  • Targeted keywords: Keywords and links associated with a webpage
  • Search friendly site: Crawler friendliness of the website
  • Inbound links –> impression of a popular website with fresh and user generated content
  • Site authority
  • Mobile-device friendliness of the website
27
Q

What SEO can we perform on a page

A
  • Keyword specific for content on page
  • Page rank assigned to one page
  • link text that describes one page
28
Q

What SEO can we perform on entire website

A
  • Crawler friendly website
  • domain trust
  • link diversity
  • geo targeting signals
29
Q

Explain targeted keywords (SEO)

A

You need to know what keyword is going to get sales..
Check out for competition, use unique keywords, consider slang, long-tail keywords.

Evaluation of keywords –> opportunity with keyword = popularity of keyword/competition for keyword

30
Q

What is long-tail keywords

A

Long-tail keywords have more words that more precisicely describe the information need. Therefore it is more likely that the page will be found by a searcher who need that page.

31
Q

How to create a search friendly site

A
  • Site map
  • Robots exlusion protocol
  • Redirection of links
  • fix broken links
  • avoid frames
  • avoid flash (doesn’t work on devices)
  • page load time under 3 sec
  • unique and fresh content
  • canonical URLs
  • evergreen content
32
Q

What is canonical URLs

A

A set of similar pages each containing a reference to a preferred page in the set. The search engine will in this way know that only the prefered page is relevant and ignore the similar pages. Thus avoid the problem of duplicate content and don’t divide the page rank

33
Q

what is Evergreen content

A

Websites need content, a subsection that does not get our of date, continously attract visitors and external links. Always being relevant, sustainable and long lasting so traffic always grows over time.

34
Q

What defines natural link structure of inbound links?

A
  • Inbound link anchor text varies
  • amount of inbound links increases gradually
  • Good and reliable links
  • links are rarely reciprocal
35
Q

What defines artifical link structure of inbound links?

A
  • inbound link anchor text is identical
  • inbound link count increase suddenly
  • links come from link farms and web rings
  • many links are reciprocal
36
Q

What is considered good links?

A

from trusted domains, on-topic authorities, high page rank, link diversity, deep linking

37
Q

What is considered a bad link

A

a link from every page of another site, low quality sites, off-topic sites (nofollow if needed)

38
Q

What types of mobile SEO can one consider

A

Dedicated mobile & desktop (2 sites) or Responsive Web design

39
Q

Why mobile SEO?

A

If your site is not mobile friendly it will not be considered by google. Mobile devices can’t see flash, popups and javascript.
Search from mobile device will more likely be location specific and if considered this will rank you higher

40
Q

pros and cons of dedicated mobile & desktop site

A

+ optimized compact mobile site
+ to be considered if you have a lot of mobile search visitors (like facebook)
- 2 sites to maintain
- mobile site doesn’t accumulate authority and thus not rank well. less text to index
- the desktop need to recognize mobile crawlers and redirect them

41
Q

What is and Pros & cons of responsive web design (adaptive web design)

A

Responsive web design is when the same site is used for mobile and desktop search results, it is the same site but in a different format. The site is squeezed to adjust the screen of the device. The size of the web page element such as images and tables is defined proportional to the size of the browser window, it is not defined in pixels rather “flexible size” “proportion-based grids”

\+ one site to maintain
\+ all page authority is accumulated in one place
\+ mobile users have full access to the content 
\+ google preffered method 
- optimized mobile page load faster 
- comprize the web site 
- mobile only features 
- older phonde do not work well
42
Q

What is black hat SEO

A

Aggressive SEO strategies unethical, that focuses on search engines and not on a human adience.

Keyword spamming, stuffing –> overuse of keywords, meta tags and link text

spam-dexing - Invisible or semi-visible text, write a lot of keywords and links and cover with picture or a flash movie over.

43
Q

What is white hat SEO

A

Refers to ethical SEO strategies. optimize the page for both humans and search engines according to manuals.