Web Mining Flashcards

1
Q

What is web mining?

A

Data mining on the web

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Is web mining the same as web searching?

A

No, web mining is about finding patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Various Web-Data Types

A
Web Pages
Intra-page structures
Inter-page structures
Usage data
Supplemental data
-  Profiles
-  Registration information
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What concepts and technologies for search engines?

A

Crawlers/Index

Profiles/Personalisation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What’s a web crawler?

A

A spider that traverses the hyperlinks and to build out a popularity of web pages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the process of a web crawler?

A
  1. Start with a seed

2. Send crawlers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an issue with Web Crawling ?

A

Time wasted for waiting for responses to requests.
To reduce inefficiency, web crawlers use threads

Web crawlers use politeness policies to stop them from flooding sites with requests.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What’s freshness?

A

Web crawlers need to revisit a page in order to maintain the freshness of a document because web pages are always being added, deleted and modified.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is focused crawling?

A

Attempt to download only those pages that about a particular topic. Popular pages tend to have links to other pages on the same topic.

Crawlers uses text classifiers to decide whether a page is on topic.

Example: Google Scholar (search by citations), Google patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the challenge for focused crawling?

A

Finding relevant links.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do we do relevance prediction? What strategies do we use?

A

Define a score as cond. prob. that a page is relevant given its text content.

Parent-based: score a fetched page and extend score to all URLs in that page

Anchor-based: score each URL based on that anchor text to that URL: “semantic linkage”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the purpose of personalization in web content mining?

A

Web access or contents to be tuned to fit the preferences of the user.

  • Edititorial and hand curated - “editor’s pick”
  • Simple aggregagates (top 10, most popular) - all users
  • Tailored to individual users -> we recommend for you
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a utility matrix?

A

Assigning recommendations. Trying to assign a score to things you haven’t seen. 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a challenge of a recommendation system?

A

Sparcity. Hard to recommend when you don’t watch.

People not giving reviews.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you overcome the challenge of sparcity in a recommendation system?

A

Be explicit:

  • Ask people to rate items
  • Doesn’t work well in practice

How? Give a reward

Be Implicit:

Learn ratings from user actions
- purchase implies high rating

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a cold start?

A

The utility matrix is sparse. Cold-start, new users have no ratings, nor history.

17
Q

What approaches do you have to recommender systems?

A

1) Content-based

2) Collaborative

18
Q

What is content-based recommendation?

A

Look at the preferences of the user and give a recommendation on this.

19
Q

What is an advantage of content-based approach?

A

No need for data on other users
Able to recommend to users with unique tastes
Able to recommend new & unpopular items
Able to provide explanations

20
Q

What is a con of content-based approach?

A
  • Finding the appropriate feature is hard
  • Recommendations for new users
  • Overspecialization (too specific, you watch 1 documentary, you get many documentaries)
21
Q

What is collaborative filtering?

A

Look at similar users who have rated something highly, and recommend to another user.

22
Q

How do you find similar users?

A

A couple of methods,

Jaccard Similarity
Cosine Similarity

23
Q

What is a pro/con of collaborative filtering?

A

Pro
- Works for any kind of item

Con

  • Need enough users in the system to find a match
  • Hard to find users that have rated the same items
  • First rate: hard to recommend an item that hasn’t been previously rated
  • Popularity bias: Popular items tend to dominate the collaborative items
24
Q

How does PageRank work?

A

Rank based on Backlinks, the number of pages that point to a webpage.

Weighting - based on the importance of the pages that link to it.