Web mining Flashcards
What 4 primary groups of data is of interest when web usage mining
usage data - Server log, navigational, HTTP requests
content data - combination of textural and image data
structure data- inter and intra likage structure
user data - profile info, demographics, cookies, past purchases
What is a server log file
The primary data source used in web usage mining, list of activities performed that is recorded by a server.
In web usage mining what is the most basic level of abstraction (isolera viktig information)
Pageview - depending on the goals of the analysis, this data need to be transformed and aggregated at different levels of abstraction
What is user activity record
A way to distinguish users without the need of identity. It is a way to see the sequence of activities per user
What is the name of the four analysis ways for web usage minining
- classification & prediction of user
- cluser analysis
- sequential patterns
- session and visitor analysis
- association and correlation analysis
What is the parts of data preparation for web usage mining
- data collection: types of data usage data, content data, structure data and user data
- data fusion and cleaning: merge log files, remove crawler refrences and irrelevant data fields
- data segmentation: depending on the data need to be transformed and aggregated. Variables of interest for analysis is users and their behaviors (pageview, session, episodes)
- path completion: imputation of missing user reference due to cashing or proxy
- data integration of the set of user sessions or episode that are useful for a pattern discovery
- data modelling can be represented as transaction matrix (vanlig tabell) or enrichment representation nxr pageview-feature matrix..
What are the parts of pattern discovery & analysis
- Session and visitor analys - basic statistics, most frequent accessed page
- Cluster analysis and visitor segmentation - Clustering groups together, pages or user clusters (personlized web content, demographics)
- Association and correlation analysis - items or pages accessed together, frequent itemsets
- Sequential and navigational patterns - consider time, and itemsets, techniques are used to create inter-session patterns for prediction of future visit pattern
- Classification and prediction of user transactions - by categorising items into bigger groups one can predict user behaviour “other people also purchased this”
What is pageview identification (data segmentation)
The most basic level of abstraction, represents a specific user event. It is possible to identify based on knowledge of domain and page content
What is user identification (data segmentation)
distinguish between different users by user activity recorded: the sequence of logged activities belonging to the same user
What is Session identification (data segmentation)
Sessionization, identification of a single visit to a site. It is the process of segmenting the activity record of each user into sessions.
What is episode identification (data segmentation)
A subset of the acitivites performed when in need of a certain information. “ subset of a session somprised of semantically related pageviews” like how many times did that user enter topic related domains.