DM Exam Paper Qs Flashcards
Give a definition of an outlier
A data point that lies unusually far from the majority of datapoints
Give an example of an outlier in credit card transactions
A fraudulent transaction
Propose two methods that can be used to detect outliers
- Clustering
- combined computer and human inspection
- regression
- box plots
Which outlier detection method is the most reliable
Combined computer and human inspection is the most reliable as the computer detects the suspicious values and they are checked by the human, this two-step process is more reliable than just relying on an algorithm
Steps of K-means algorithm
- randomly select k initial cluster centroids
- assign every item to its nearest centroid
- recompute the centroids
- Repeat from step 2 until no reassignments occur
Give three application examples of spatiotemporal data streams
- traffic monitoring
- environmental monitoring
- weather pattern analysis
- asset tracking in logistics
Discuss what kind of interesting knowledge can be mined from such data streams, with limited time and resources.
Summarisation and aggregation of data at different levels of granularity
Identify and discuss the major challenges in spatiotemporal data mining
High Dimensionality: it involves multiple dimensions (space and time) for each data point
- Analysing and visualising can be computationally intensive
- careful consideration for data structure
- choice of computation method (selective, partial, full)
Using one application example, sketch a method to mine one kind of knowledge from such stream data efficiently.
- discuss the inputs and outputs of weather data warehouse
- Draw star schema of weather warehouse
- Use OLAP for efficient data analysis
What is the bottleneck of Apriori Algorithm
Candidate Generation
- very large candidate sets
- multiple scans of the database
Briefly explain a method to improve Apriori’s efficiency
Transaction reduction: A transaction that does not contain any frequent itemsets is useless in subsequent scans
Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB
Apriori Algorithm Steps
Given a Database D and a minimum support MS:
1. Initial scan of D to get the frequencies of each individual item (Candidate generation)
2. Eliminate candidates that don’t have MS
3. Construct new candidates by combining eligible candidates (resulting itemset must be only one item larger)
4. Repeat pruning and construction until all frequent itemsets are found
Give the necessary steps of the FP-growth algorithm.
- Scan dataset and construct frequency table of each item
- Eliminate items without minimum support
- Transform the transactions ordering the frequent items in descending order
- Construct tree in top-down recursive fashion
What is a data cube and how is it formed?
A multidimensional data model views data in the form of a data cube.
It is formed from the lattice of cuboids and allows data to be modelled and viewed in multiple dimensions
What is a fact table
Contains keys to each of the related dimension tables and measures
Give a definition for each of the three categories of measures that can be used for the data warehouse
Distributed: can be computed independently on partitions of the data (e.g. count)
Algebraic: involve combining results from distributive functions in a structured way (e.g. avg)
Holistic: require analyzing the entire dataset as a whole due to their complexity and lack of constant bounds on storage size (e.g. median)
Discuss how to efficiently calculate the top 10 values of a feature in a data cube that is partitioned into multiple chunks
Get the maximum of all the chunks, remove it and repeat the process 10 times
Define the notions of support and confidence
rule: Y → Z
Support: probability that a transaction contains {Y → Z}
Confidence: Conditional probability that a transaction having Y also contains Z
Explain why holistic measures are not desirable when designing a data warehouse.
- provide: valuable insights
- trade-offs: computational complexity, scalability, storage requirements
- less desirable in the context of a data warehouse, where performance and responsiveness are critical considerations
K-means complexity calculation
O(tkn)
t: number of iterations
k: number of centroids
n: number of data points
What kind of problem is k-means clustering and what does it mean
It is NP-hard, which means that it is at least as hard as any NP-problem, although it might, in fact, be harder
Describe the possible negative effects of proceeding directly to mine the data that has not been pre-processed.
- inaccurate results
- overfitting
- data inconsistency
Define Information Retrieval
the process of obtaining relevant information from a large repository of data, typically in the form of documents or records, in response to a user’s information need
Define Precision and Recall
Precision: The percentage of retrieved documents that are in fact relevant to the query
Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved
What is the main difference between clustering and classification
Clustering is an unsupervised learning technique which groups the data.
Classification is a supervised learning technique that predicts the class or value of unseen data
Describe briefly the necessary steps for handling ordinal variables when computing dissimilarity measures.
Convert the ordinal categories into numerical values while preserving the ordinal relationship. Assign integers to ordinal categories based on their order.
How to calculate the dissimilarity matrix for an attribute variable
- Calculate dissimilarity: for each pair of observations, calculate the dissimilarity using the chosen distance metric.
- Form the matrix: create a square matrix where each element represents the dissimilarity between two observations
Define the notion of a frequent itemset
Any subset of a frequent itemset must be frequent
The FP-growth algorithm adopts the strategy of divide-and-conquer. What is the main advantage of this strategy over the one used in the Apriori algorithm?
Its compactness.
- reduces irrelevant info: no infrequent items
- descending ordering of frequencies
- result is never larger than the original dataset
Describe briefly the data mining process steps
- Data preprocessing: Gathering, Cleaning, transformation, integration
- Data Analysis: Model building and evaluation
It was claimed that the clustering can be used for both pre-processing and data analysis. Explain the difference between the two applications of clustering
- sampling in Data Preprocessing
- gain insights in Data Analysis
Explain the main differences between data sampling and data clustering during the pre-processing step
- sampling is using representatives from the clusters
- clustering is the process of identifying the clusters
Explain how a query-driven approach is applied on heterogeneous datasets.
A wrapper is built on top of the database’s meta-dictionary, the MD solves the inconsistencies in the dataset
Give the advantages and disadvantages of an update-driven approach
ADV: faster processing, data is stored and structured
DISADV: small datasets create overhead when the data is homogeneous
Explain why an update-driven approach is preferred to a query-driven approach
When you have a large dataset of heterogeneous sources, it provides faster processing with potentially lower costs in the long term
Describe situations where a query-driven approach is preferable to an update-driven approach in pre-processing
when the dataset is small or consists of homogeneous sources
How to draw a snowflake schema
Extend the main Fact Table to its nested fact tables
Give a brief description of the core OLAP operations
- Roll up (drill-up): summarise data by climbing up hierarchy or dimension reduction
- Drill down (roll down): reverse of roll-up
- Slice and dice: project and select
- Pivot (rotate): reorientate the cube