DM Exam Paper Qs Flashcards
Give a definition of an outlier
A data point that lies unusually far from the majority of datapoints
Give an example of an outlier in credit card transactions
A fraudulent transaction
Propose two methods that can be used to detect outliers
- Clustering
- combined computer and human inspection
- regression
- box plots
Which outlier detection method is the most reliable
Combined computer and human inspection is the most reliable as the computer detects the suspicious values and they are checked by the human, this two-step process is more reliable than just relying on an algorithm
Steps of K-means algorithm
- randomly select k initial cluster centroids
- assign every item to its nearest centroid
- recompute the centroids
- Repeat from step 2 until no reassignments occur
Give three application examples of spatiotemporal data streams
- traffic monitoring
- environmental monitoring
- weather pattern analysis
- asset tracking in logistics
Discuss what kind of interesting knowledge can be mined from such data streams, with limited time and resources.
Summarisation and aggregation of data at different levels of granularity
Identify and discuss the major challenges in spatiotemporal data mining
High Dimensionality: it involves multiple dimensions (space and time) for each data point
- Analysing and visualising can be computationally intensive
- careful consideration for data structure
- choice of computation method (selective, partial, full)
Using one application example, sketch a method to mine one kind of knowledge from such stream data efficiently.
- discuss the inputs and outputs of weather data warehouse
- Draw star schema of weather warehouse
- Use OLAP for efficient data analysis
What is the bottleneck of Apriori Algorithm
Candidate Generation
- very large candidate sets
- multiple scans of the database
Briefly explain a method to improve Apriori’s efficiency
Transaction reduction: A transaction that does not contain any frequent itemsets is useless in subsequent scans
Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB
Apriori Algorithm Steps
Given a Database D and a minimum support MS:
1. Initial scan of D to get the frequencies of each individual item (Candidate generation)
2. Eliminate candidates that don’t have MS
3. Construct new candidates by combining eligible candidates (resulting itemset must be only one item larger)
4. Repeat pruning and construction until all frequent itemsets are found
Give the necessary steps of the FP-growth algorithm.
- Scan dataset and construct frequency table of each item
- Eliminate items without minimum support
- Transform the transactions ordering the frequent items in descending order
- Construct tree in top-down recursive fashion
What is a data cube and how is it formed?
A multidimensional data model views data in the form of a data cube.
It is formed from the lattice of cuboids and allows data to be modelled and viewed in multiple dimensions
What is a fact table
Contains keys to each of the related dimension tables and measures
Give a definition for each of the three categories of measures that can be used for the data warehouse
Distributed: can be computed independently on partitions of the data (e.g. count)
Algebraic: involve combining results from distributive functions in a structured way (e.g. avg)
Holistic: require analyzing the entire dataset as a whole due to their complexity and lack of constant bounds on storage size (e.g. median)
Discuss how to efficiently calculate the top 10 values of a feature in a data cube that is partitioned into multiple chunks
Get the maximum of all the chunks, remove it and repeat the process 10 times
Define the notions of support and confidence
rule: Y → Z
Support: probability that a transaction contains {Y → Z}
Confidence: Conditional probability that a transaction having Y also contains Z
Explain why holistic measures are not desirable when designing a data warehouse.
- provide: valuable insights
- trade-offs: computational complexity, scalability, storage requirements
- less desirable in the context of a data warehouse, where performance and responsiveness are critical considerations
K-means complexity calculation
O(tkn)
t: number of iterations
k: number of centroids
n: number of data points
What kind of problem is k-means clustering and what does it mean
It is NP-hard, which means that it is at least as hard as any NP-problem, although it might, in fact, be harder
Describe the possible negative effects of proceeding directly to mine the data that has not been pre-processed.
- inaccurate results
- overfitting
- data inconsistency
Define Information Retrieval
the process of obtaining relevant information from a large repository of data, typically in the form of documents or records, in response to a user’s information need
Define Precision and Recall
Precision: The percentage of retrieved documents that are in fact relevant to the query
Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved
What is the main difference between clustering and classification
Clustering is an unsupervised learning technique which groups the data.
Classification is a supervised learning technique that predicts the class or value of unseen data
Describe briefly the necessary steps for handling ordinal variables when computing dissimilarity measures.
Convert the ordinal categories into numerical values while preserving the ordinal relationship. Assign integers to ordinal categories based on their order.
How to calculate the dissimilarity matrix for an attribute variable
- Calculate dissimilarity: for each pair of observations, calculate the dissimilarity using the chosen distance metric.
- Form the matrix: create a square matrix where each element represents the dissimilarity between two observations
Define the notion of a frequent itemset
Any subset of a frequent itemset must be frequent
The FP-growth algorithm adopts the strategy of divide-and-conquer. What is the main advantage of this strategy over the one used in the Apriori algorithm?
Its compactness.
- reduces irrelevant info: no infrequent items
- descending ordering of frequencies
- result is never larger than the original dataset
Describe briefly the data mining process steps
- Data preprocessing: Gathering, Cleaning, transformation, integration
- Data Analysis: Model building and evaluation
It was claimed that the clustering can be used for both pre-processing and data analysis. Explain the difference between the two applications of clustering
- sampling in Data Preprocessing
- gain insights in Data Analysis
Explain the main differences between data sampling and data clustering during the pre-processing step
- sampling is using representatives from the clusters
- clustering is the process of identifying the clusters
Explain how a query-driven approach is applied on heterogeneous datasets.
A wrapper is built on top of the database’s meta-dictionary, the MD solves the inconsistencies in the dataset
Give the advantages and disadvantages of an update-driven approach
ADV: faster processing, data is stored and structured
DISADV: small datasets create overhead when the data is homogeneous
Explain why an update-driven approach is preferred to a query-driven approach
When you have a large dataset of heterogeneous sources, it provides faster processing with potentially lower costs in the long term
Describe situations where a query-driven approach is preferable to an update-driven approach in pre-processing
when the dataset is small or consists of homogeneous sources
How to draw a snowflake schema
Extend the main Fact Table to its nested fact tables
Give a brief description of the core OLAP operations
- Roll up (drill-up): summarise data by climbing up hierarchy or dimension reduction
- Drill down (roll down): reverse of roll-up
- Slice and dice: project and select
- Pivot (rotate): reorientate the cube
One of the benefits of the FP-tree structure is Compactness. Explain why FP-growth method is compact
- reduces irrelevant info: no infrequent items
- descending ordering of frequencies: more frequent items are more likely to be shared
- never larger than the original dataset
Explain how the FP-growth method avoids the two costly problems of the Apriori algorithm
The two costly problem of huge candidate sets and multiple scans of the database are avoided by scanning the database once.
Can you always find an optimal clustering with k-means? Justify your answer
No, as optimal clustering relies on a certain initialisation of the centroids.
Illustrate the strength and weakness of k-means in comparison with a hierarchical clustering scheme (e.g. AGNES)
- initial initialisation: AGNES is robust
- cluster shapes: AGNES is more flexible
- outlier sensitivity: K-means is very sensitive
- specification of number of clusters
- k-means is less expensive
Illustrate the strength and weakness of k-means in comparison with k-medoids
- outliers: k-medoids is more robust
- cluster shapes: k-medoids does not assume spherical shapes
- computational complexity: k-medoids is higher
what is the basic methodology of Latent Semantic Indexing
Create a frequency table
Use a singular value decomposition (SVD) technique to reduce the size of the frequency table, then retain the most significant rows
What does DBSCAN discover, and what does it rely on?
- clusters of arbitrary shape in spatial datasets with noise
- a density-based notion of cluster
Properties of a spatial data warehouse
Same as DW; integrated, subject-oriented, time-variant, and non-volatile
What are the two parameters in density-based clustering
Eps: max radius of the neighbourhood
MinPts: min number of points in an eps-neighbourhood of that point
List clustering applications
- Pattern recognition
- image processing
- economic science
- spatial data analysis
- WWW
What is the simplified assumption in the Naive Bayes Classifier
attributes are conditionally independent
Bayesian Theorem Formula
P(A|B) = P(B|A) P(A) / P(B)
Briefly explain the two methods to avoid overfitting in Decision Trees
Post-pruning: Remove branches from a “fully grown” tree - get a sequence of progressively pruned trees
Pre-pruning: Halt tree construction early - do not split a node if this would result in the goodness measure falling below a threshold
How is an attribute split performed using the Gini Index?
The attribute that provides the smallest Gini Split is chosen to split the node
What does the Gini index measure?
The impurity or purity of the dataset
How is a Decision Tree constructed using a basic algorithm (a greedy algorithm)?
In a top-down recursive divide-and-conquer manner
What is tree pruning?
Identification and removal of branches that reflect noise or outliers
Give an example of web structure mining using (a) links and (b) generalisation
a) PageRank: assignment of weights to pages using interconnections between pages
b) VWV: multi-level database representation of the web
What is Singular Value Decomposition (SVD)?
a matrix factorization method that decomposes a matrix into three other matrices: U, S, and V.
What are some major difficulties of keyword-based retrieval?
- Synonymy: A keyword does not appear anywhere in the document, even though the document is closely related to the keyword
- Polysemy: The same keyword may mean different things in different contexts
what is the strategy for multidimensional analysis of complex data objects?
- generalise the plan-base in different directions
- look for sequential patterns in the generalised plans
- derive high-level plans
what is a plan and plan mining
- a plan is a variable sequence of actions
- plan mining is extracting significant generalised (sequential) patterns from a plan-base (large collection of plans)
Why Decision Tree Induction in data mining?
- relatively faster learning speed (than other classification methods)
- convertible to simple and easy-to-understand classification rules
- can use SQL queries for accessing databases
- comparable classification accuracy with other methods
List some enhancements to basic decision tree induction
- Allow continuous-valued attributes
- handle missing attribute values
- attribute construction
What is the general methodology of Association-Based Document Classification
- Extract keywords and terms by information retrieval and association analysis techniques
- Obtain the concept hierarchies, then perform classification and association mining methods
Keyword-based association analysis definition
Collect sets of keywords that occur frequently together and then find the association or correlation relationships among them
what is the step-by-step method for performing Latent Semantic Indexing
- Create a term frequency matrix
- SVD construction
- Vector Identification
- Index creation
What are the purposes of dimension tables, cardinality, and operators in generalisation-based sequence mining
- use dimension tables to generalise plan-base in a multidimensional way
- cardinality determines the right level of generalisation (level planning)
- use operators (merge + , option []) to further generalise patterns
what are the steps for mining spatial association
- rough spatial computation (as a filter)
- detailed spatial algorithm (as a refinement)
two categories of similarity queries in time-series analysis
- Whole matching
- subsequence matching
List major features of density-based clustering methods
- discover clusters of arbitrary shape
- handle noise
- one scan
- need density parameters as stopping condition
in K-means, the global optimum may be found using what techniques?
- deterministic annealing
- genetic algorithms
List the Data Mining Tasks
- Problem Definition
- Data gathering and preparation
- Model building and evaluation
- Knowledge deployment
List the four Data Quality Types in order
Perfect, not perfect, inspection, soft.
What is data discretisation?
reducing the number of values for a continuous variable by dividing the range into intervals, replacing the actual values with interval labels.
List the three central tendency statistics
mean, median, midrange
List the three data dispersion statistics
quantiles, IQR, variance
Five Number Summary
min, q1, median, q3, max
How to do equal-width partitioning?
Divide the range into N intervals of equal size
Width = (Max-Min)/N
How to do equal-depth partitioning?
Divide the range into N intervals, each containing appx. the same number of objects
name a method to detect redundant data
correlation-based analysis
Give an example a non-parametric method for achieving numerosity reduction
Histograms: Divide the data into buckets and store average for each bucket
Clustering: partition the dataset into clusters, and one can store cluster representation only
Stratified Sampling: approximate the percentage of each class in the overall database to choose a representative subset of the data
How to use concept hierarchies for data reduction
collect and replace low level concepts (numerical age) by higher level concepts (young, old)
What is a Virtual Data Warehouse?
A set of views over operational databases
How is a DW subject-orientated?
- Organised around major subjects, such as customer, product, sales
- Focusing on the modelling and analysis of data for decision makers, not on daily operations or transaction processing
- Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
How is a DW integrated?
- integrates multiple heterogenous data sources
- apply techniques of data cleaning and integration
How is a DW time variant?
- the data provides information from a historical perspective rather than current value data
- every key structure in the DW contains an element of time, explicitly or implicitly
How is a DW non-volatile?
once data is loaded in, it is typically not subject to frequent changes or updates. The data remains relatively stable and unchanged over time.
What is Online Transaction Processing?
- OLTP is a major task of traditional relational DB
- used by IT for day-to-day operations
What is a challenge in Weather Pattern Analysis when using a spatial data warehouse
A merged region may contain hundreds of primitive regions (polygons)
What are Polygons in the context of spatial data warehouses
Spatial Areas
What Dimensions and measurements are in the Fact Table of a Spatial Data Warehouse for mining weather pattern analysis
- Dimensions: Region_name, time, precipitation, temperature
- Measurements: region_map, area, count
What is a reasonable choice for choice of computation method for Spatial Data Cubes
Selective computation: Only materialise spatial objects that will be accessed frequently
What is the difference between a traditional DB and a Data Warehouse
- DB used for day-to-day operations using OLTP
- DW used for data analysis using OLAP
How to generalise spatial data, and what does it require
-
Generalise detailed geographic points into clustered regions, such as business, residential,
industrial, or agricultural areas, according to land usage - requires the merge of a set of geographic areas by spatial operations
Structure of star schema for weather warehouse
Fact Table with the four dimensions and 3 measures:
Time: time_key, day, month…
Region: region_key, name, location, city,…
Temperature: temp_key, range, temp_value, description
Precipitation: key, range, value, description
Measures: map, area, count
Inputs and output of Weather Spatial Data Warehouse
Input:
- a map with weather probes scattered around in an area
- daily weather data
- concept hierarchies for all attributes
Output:
- a map that reveals patterns: merged (similar) regions
Method to efficiently generate candidate sets
Store candidate itemsets in a hash-tree
- leaf node of tree contains a list of itemsets and counts
- Interior node contains a hash table
- subset function: finds all the candidates contained in a transaction