Web Mining Flashcards
What is web mining?
Data mining on the web
Is web mining the same as web searching?
No, web mining is about finding patterns.
Various Web-Data Types
Web Pages Intra-page structures Inter-page structures Usage data Supplemental data - Profiles - Registration information
What concepts and technologies for search engines?
Crawlers/Index
Profiles/Personalisation
What’s a web crawler?
A spider that traverses the hyperlinks and to build out a popularity of web pages.
What is the process of a web crawler?
- Start with a seed
2. Send crawlers
What is an issue with Web Crawling ?
Time wasted for waiting for responses to requests.
To reduce inefficiency, web crawlers use threads
Web crawlers use politeness policies to stop them from flooding sites with requests.
What’s freshness?
Web crawlers need to revisit a page in order to maintain the freshness of a document because web pages are always being added, deleted and modified.
What is focused crawling?
Attempt to download only those pages that about a particular topic. Popular pages tend to have links to other pages on the same topic.
Crawlers uses text classifiers to decide whether a page is on topic.
Example: Google Scholar (search by citations), Google patterns
What is the challenge for focused crawling?
Finding relevant links.
How do we do relevance prediction? What strategies do we use?
Define a score as cond. prob. that a page is relevant given its text content.
Parent-based: score a fetched page and extend score to all URLs in that page
Anchor-based: score each URL based on that anchor text to that URL: “semantic linkage”
What is the purpose of personalization in web content mining?
Web access or contents to be tuned to fit the preferences of the user.
- Edititorial and hand curated - “editor’s pick”
- Simple aggregagates (top 10, most popular) - all users
- Tailored to individual users -> we recommend for you
What is a utility matrix?
Assigning recommendations. Trying to assign a score to things you haven’t seen. 1
What is a challenge of a recommendation system?
Sparcity. Hard to recommend when you don’t watch.
People not giving reviews.
How do you overcome the challenge of sparcity in a recommendation system?
Be explicit:
- Ask people to rate items
- Doesn’t work well in practice
How? Give a reward
Be Implicit:
Learn ratings from user actions
- purchase implies high rating
What is a cold start?
The utility matrix is sparse. Cold-start, new users have no ratings, nor history.
What approaches do you have to recommender systems?
1) Content-based
2) Collaborative
What is content-based recommendation?
Look at the preferences of the user and give a recommendation on this.
What is an advantage of content-based approach?
No need for data on other users
Able to recommend to users with unique tastes
Able to recommend new & unpopular items
Able to provide explanations
What is a con of content-based approach?
- Finding the appropriate feature is hard
- Recommendations for new users
- Overspecialization (too specific, you watch 1 documentary, you get many documentaries)
What is collaborative filtering?
Look at similar users who have rated something highly, and recommend to another user.
How do you find similar users?
A couple of methods,
Jaccard Similarity
Cosine Similarity
What is a pro/con of collaborative filtering?
Pro
- Works for any kind of item
Con
- Need enough users in the system to find a match
- Hard to find users that have rated the same items
- First rate: hard to recommend an item that hasn’t been previously rated
- Popularity bias: Popular items tend to dominate the collaborative items
How does PageRank work?
Rank based on Backlinks, the number of pages that point to a webpage.
Weighting - based on the importance of the pages that link to it.