Final Exam - Theory Flashcards
What is web mining?
Data mining on the web
Is web mining the same as web searching?
No, web mining is about finding patterns.
What are the various Web-Data Types?
Web PagesIntra-page structuresInter-page structuresUsage dataSupplemental data- Profiles- Registration information
What concepts and technologies for search engines?
Crawlers/IndexProfiles/Personalisation
What’s a web crawler?
A spider that traverses the hyperlinks and to build out a popularity of web pages.
What is the process of a web crawler?
- Start with a seed2. Send crawlers
What is an issue with Web Crawling ?
Time wasted for waiting for responses to requests.To reduce inefficiency, web crawlers use threadsWeb crawlers use politeness policies to stop them from flooding sites with requests.
What’s freshness?
Web crawlers need to revisit a page in order to maintain the freshness of a document because web pages are always being added, deleted and modified.
What is focused crawling?
Attempt to download only those pages that about a particular topic. Popular pages tend to have links to other pages on the same topic. Crawlers uses text classifiers to decide whether a page is on topic. Example: Google Scholar (search by citations), Google patterns
What is the challenge for focused crawling?
Finding relevant links.
How do we do relevance prediction? What strategies do we use?
Define a score as cond. prob. that a page is relevant given its text content.Parent-based: score a fetched page and extend score to all URLs in that pageAnchor-based: score each URL based on that anchor text to that URL: “semantic linkage”
What is the purpose of personalization in web content mining?
Web access or contents to be tuned to fit the preferences of the user.- Edititorial and hand curated - “editor’s pick”- Simple aggregagates (top 10, most popular) - all users- Tailored to individual users -> we recommend for you
What is a utility matrix?
Assigning recommendations. Trying to assign a score to things you haven’t seen. 1
What is a challenge of a recommendation system?
Sparcity. Hard to recommend when you don’t watch.People not giving reviews.
How do you overcome the challenge of sparcity in a recommendation system?
Be explicit: - Ask people to rate items- Doesn’t work well in practiceHow? Give a rewardBe Implicit:Learn ratings from user actions- purchase implies high rating
What is a cold start?
The utility matrix is sparse. Cold-start, new users have no ratings, nor history.
What approaches do you have to recommender systems?
1) Content-based2) Collaborative
What is content-based recommendation?
Look at the preferences of the user and give a recommendation on this.
What is an advantage of content-based approach?
No need for data on other usersAble to recommend to users with unique tastesAble to recommend new & unpopular itemsAble to provide explanations
What is a con of content-based approach?
- Finding the appropriate feature is hard- Recommendations for new users- Overspecialization (too specific, you watch 1 documentary, you get many documentaries)
What is collaborative filtering?
Look at similar users who have rated something highly, and recommend to another user.
How do you find similar users?
A couple of methods, Jaccard SimilarityCosine Similarity
What is a pro/con of collaborative filtering?
Pro- Works for any kind of itemCon- Need enough users in the system to find a match- Hard to find users that have rated the same items- First rate: hard to recommend an item that hasn’t been previously rated- Popularity bias: Popular items tend to dominate the collaborative items
How does PageRank work?
Rank based on Backlinks, the number of pages that point to a webpage.Weighting - based on the importance of the pages that link to it.
What methods do Text Mining use?
Information Retrieval. Pre-processing of text documnets
What tasks do text mining do?
Text Classification, Text Clustering or Text Summarization
What is an issue with text mining vs traditional data mining?
Traditional data mining is structured. Text often has no real structure.
What is a Vector Space Model?
A document is represented as a “bag” of words.
What is a problem with Vector Space Model?
There are many words in the English language.
How do you fix the limitations of the Vector Space Model?
Removing the stop words (“A, the, this, that …“)Stemming (e.g combine the similar verbs (past/present tense)