Web Mining Flashcards
What is web mining?
Data mining on the web
Is web mining the same as web searching?
No, web mining is about finding patterns.
Various Web-Data Types
Web Pages Intra-page structures Inter-page structures Usage data Supplemental data - Profiles - Registration information
What concepts and technologies for search engines?
Crawlers/Index
Profiles/Personalisation
What’s a web crawler?
A spider that traverses the hyperlinks and to build out a popularity of web pages.
What is the process of a web crawler?
- Start with a seed
2. Send crawlers
What is an issue with Web Crawling ?
Time wasted for waiting for responses to requests.
To reduce inefficiency, web crawlers use threads
Web crawlers use politeness policies to stop them from flooding sites with requests.
What’s freshness?
Web crawlers need to revisit a page in order to maintain the freshness of a document because web pages are always being added, deleted and modified.
What is focused crawling?
Attempt to download only those pages that about a particular topic. Popular pages tend to have links to other pages on the same topic.
Crawlers uses text classifiers to decide whether a page is on topic.
Example: Google Scholar (search by citations), Google patterns
What is the challenge for focused crawling?
Finding relevant links.
How do we do relevance prediction? What strategies do we use?
Define a score as cond. prob. that a page is relevant given its text content.
Parent-based: score a fetched page and extend score to all URLs in that page
Anchor-based: score each URL based on that anchor text to that URL: “semantic linkage”
What is the purpose of personalization in web content mining?
Web access or contents to be tuned to fit the preferences of the user.
- Edititorial and hand curated - “editor’s pick”
- Simple aggregagates (top 10, most popular) - all users
- Tailored to individual users -> we recommend for you
What is a utility matrix?
Assigning recommendations. Trying to assign a score to things you haven’t seen. 1
What is a challenge of a recommendation system?
Sparcity. Hard to recommend when you don’t watch.
People not giving reviews.
How do you overcome the challenge of sparcity in a recommendation system?
Be explicit:
- Ask people to rate items
- Doesn’t work well in practice
How? Give a reward
Be Implicit:
Learn ratings from user actions
- purchase implies high rating