DM Exam Paper Qs Flashcards

Question 1

Q

Give a definition of an outlier

Answer

A

A data point that lies unusually far from the majority of datapoints

Question 2

Q

Give an example of an outlier in credit card transactions

Answer

A

A fraudulent transaction

Question 3

Q

Propose two methods that can be used to detect outliers

Answer

A

Clustering
combined computer and human inspection
regression
box plots

Question 4

Q

Which outlier detection method is the most reliable

Answer

A

Combined computer and human inspection is the most reliable as the computer detects the suspicious values and they are checked by the human, this two-step process is more reliable than just relying on an algorithm

Question 5

Q

Steps of K-means algorithm

Answer

A

randomly select k initial cluster centroids
assign every item to its nearest centroid
recompute the centroids
Repeat from step 2 until no reassignments occur

Question 6

Q

Give three application examples of spatiotemporal data streams

Answer

A

traffic monitoring
environmental monitoring
weather pattern analysis
asset tracking in logistics

Question 7

Q

Discuss what kind of interesting knowledge can be mined from such data streams, with limited time and resources.

Answer

A

Summarisation and aggregation of data at different levels of granularity

Question 8

Q

Identify and discuss the major challenges in spatiotemporal data mining

Answer

A

High Dimensionality: it involves multiple dimensions (space and time) for each data point

Analysing and visualising can be computationally intensive
careful consideration for data structure
choice of computation method (selective, partial, full)

Question 9

Q

Using one application example, sketch a method to mine one kind of knowledge from such stream data efficiently.

Answer

A

discuss the inputs and outputs of weather data warehouse
Draw star schema of weather warehouse
Use OLAP for efficient data analysis

Question 10

Q

What is the bottleneck of Apriori Algorithm

Answer

A

Candidate Generation
- very large candidate sets
- multiple scans of the database

Question 11

Q

Briefly explain a method to improve Apriori’s efficiency

Answer

A

Transaction reduction: A transaction that does not contain any frequent itemsets is useless in subsequent scans
Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB

Question 12

Q

Apriori Algorithm Steps

Answer

A

Given a Database D and a minimum support MS:
1. Initial scan of D to get the frequencies of each individual item (Candidate generation)
2. Eliminate candidates that don’t have MS
3. Construct new candidates by combining eligible candidates (resulting itemset must be only one item larger)
4. Repeat pruning and construction until all frequent itemsets are found

Question 13

Q

Give the necessary steps of the FP-growth algorithm.

Answer

A

Scan dataset and construct frequency table of each item
Eliminate items without minimum support
Transform the transactions ordering the frequent items in descending order
Construct tree in top-down recursive fashion

Question 14

Q

What is a data cube and how is it formed?

Answer

A

A multidimensional data model views data in the form of a data cube.
It is formed from the lattice of cuboids and allows data to be modelled and viewed in multiple dimensions

Question 15

Q

What is a fact table

Answer

A

Contains keys to each of the related dimension tables and measures

Question 16

Q

Give a definition for each of the three categories of measures that can be used for the data warehouse

Answer

A

Distributed: can be computed independently on partitions of the data (e.g. count)
Algebraic: involve combining results from distributive functions in a structured way (e.g. avg)
Holistic: require analyzing the entire dataset as a whole due to their complexity and lack of constant bounds on storage size (e.g. median)

Question 17

Q

Discuss how to efficiently calculate the top 10 values of a feature in a data cube that is partitioned into multiple chunks

Answer

A

Get the maximum of all the chunks, remove it and repeat the process 10 times

Question 18

Q

Define the notions of support and confidence

Answer

A

rule: Y → Z
Support: probability that a transaction contains {Y → Z}
Confidence: Conditional probability that a transaction having Y also contains Z

Question 19

Q

Explain why holistic measures are not desirable when designing a data warehouse.

Answer

A

provide: valuable insights
trade-offs: computational complexity, scalability, storage requirements
less desirable in the context of a data warehouse, where performance and responsiveness are critical considerations

Question 20

Q

K-means complexity calculation

Answer

A

O(tkn)
t: number of iterations
k: number of centroids
n: number of data points

Question 21

Q

What kind of problem is k-means clustering and what does it mean

Answer

A

It is NP-hard, which means that it is at least as hard as any NP-problem, although it might, in fact, be harder

Question 22

Q

Describe the possible negative effects of proceeding directly to mine the data that has not been pre-processed.

Answer

A

inaccurate results
overfitting
data inconsistency

Question 23

Q

Define Information Retrieval

Answer

A

the process of obtaining relevant information from a large repository of data, typically in the form of documents or records, in response to a user’s information need

Question 24

Q

Define Precision and Recall

Answer

A

Precision: The percentage of retrieved documents that are in fact relevant to the query
Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved

Question 25

Q

What is the main difference between clustering and classification

Answer

A

Clustering is an unsupervised learning technique which groups the data.
Classification is a supervised learning technique that predicts the class or value of unseen data

Question 26

Q

Describe briefly the necessary steps for handling ordinal variables when computing dissimilarity measures.

Answer

A

Convert the ordinal categories into numerical values while preserving the ordinal relationship. Assign integers to ordinal categories based on their order.

Question 27

Q

How to calculate the dissimilarity matrix for an attribute variable

Answer

A

Calculate dissimilarity: for each pair of observations, calculate the dissimilarity using the chosen distance metric.
Form the matrix: create a square matrix where each element represents the dissimilarity between two observations

Question 28

Q

Define the notion of a frequent itemset

Answer

A

Any subset of a frequent itemset must be frequent

Question 29

Q

The FP-growth algorithm adopts the strategy of divide-and-conquer. What is the main advantage of this strategy over the one used in the Apriori algorithm?

Answer

A

Its compactness.
- reduces irrelevant info: no infrequent items
- descending ordering of frequencies
- result is never larger than the original dataset

Question 30

Q

Describe briefly the data mining process steps

Answer

A

Data preprocessing: Gathering, Cleaning, transformation, integration
Data Analysis: Model building and evaluation

Question 31

Q

It was claimed that the clustering can be used for both pre-processing and data analysis. Explain the difference between the two applications of clustering

Answer

A

sampling in Data Preprocessing
gain insights in Data Analysis

Question 32

Q

Explain the main differences between data sampling and data clustering during the pre-processing step

Answer

A

sampling is using representatives from the clusters
clustering is the process of identifying the clusters

Question 33

Q

Explain how a query-driven approach is applied on heterogeneous datasets.

Answer

A

A wrapper is built on top of the database’s meta-dictionary, the MD solves the inconsistencies in the dataset

Question 34

Q

Give the advantages and disadvantages of an update-driven approach

Answer

A

ADV: faster processing, data is stored and structured
DISADV: small datasets create overhead when the data is homogeneous

Question 35

Q

Explain why an update-driven approach is preferred to a query-driven approach

Answer

A

When you have a large dataset of heterogeneous sources, it provides faster processing with potentially lower costs in the long term

Question 36

Q

Describe situations where a query-driven approach is preferable to an update-driven approach in pre-processing

Answer

A

when the dataset is small or consists of homogeneous sources

Question 37

Q

How to draw a snowflake schema

Answer

A

Extend the main Fact Table to its nested fact tables

Question 38

Q

Give a brief description of the core OLAP operations

Answer

A

Roll up (drill-up): summarise data by climbing up hierarchy or dimension reduction
Drill down (roll down): reverse of roll-up
Slice and dice: project and select
Pivot (rotate): reorientate the cube

Question 39

Q

One of the benefits of the FP-tree structure is Compactness. Explain why FP-growth method is compact

Answer

A

reduces irrelevant info: no infrequent items
descending ordering of frequencies: more frequent items are more likely to be shared
never larger than the original dataset

Question 40

Q

Explain how the FP-growth method avoids the two costly problems of the Apriori algorithm

Answer

A

The two costly problem of huge candidate sets and multiple scans of the database are avoided by scanning the database once.

Question 41

Q

Can you always find an optimal clustering with k-means? Justify your answer

Answer

A

No, as optimal clustering relies on a certain initialisation of the centroids.

Question 42

Q

Illustrate the strength and weakness of k-means in comparison with a hierarchical clustering scheme (e.g. AGNES)

Answer

A

initial initialisation: AGNES is robust
cluster shapes: AGNES is more flexible
outlier sensitivity: K-means is very sensitive
specification of number of clusters
k-means is less expensive

Question 43

Q

Illustrate the strength and weakness of k-means in comparison with k-medoids

Answer

A

outliers: k-medoids is more robust
cluster shapes: k-medoids does not assume spherical shapes
computational complexity: k-medoids is higher

Question 44

Q

what is the basic methodology of Latent Semantic Indexing

Answer

A

Create a frequency table

Use a singular value decomposition (SVD) technique to reduce the size of the frequency table, then retain the most significant rows

Question 45

Q

What does DBSCAN discover, and what does it rely on?

Answer

A

clusters of arbitrary shape in spatial datasets with noise
a density-based notion of cluster

Question 46

Q

Properties of a spatial data warehouse

Answer

A

Same as DW; integrated, subject-oriented, time-variant, and non-volatile

Question 47

Q

What are the two parameters in density-based clustering

Answer

A

Eps: max radius of the neighbourhood
MinPts: min number of points in an eps-neighbourhood of that point

Question 48

Q

List clustering applications

Answer

A

Pattern recognition
image processing
economic science
spatial data analysis
WWW

Question 49

Q

What is the simplified assumption in the Naive Bayes Classifier

Answer

A

attributes are conditionally independent

Question 50

Q

Bayesian Theorem Formula

Answer

A

P(A|B) = P(B|A) P(A) / P(B)

Question 51

Q

Briefly explain the two methods to avoid overfitting in Decision Trees

Answer

A

Post-pruning: Remove branches from a “fully grown” tree - get a sequence of progressively pruned trees
Pre-pruning: Halt tree construction early - do not split a node if this would result in the goodness measure falling below a threshold

Question 52

Q

How is an attribute split performed using the Gini Index?

Answer

A

The attribute that provides the smallest Gini Split is chosen to split the node

Question 53

Q

What does the Gini index measure?

Answer

A

The impurity or purity of the dataset

Question 54

Q

How is a Decision Tree constructed using a basic algorithm (a greedy algorithm)?

Answer

A

In a top-down recursive divide-and-conquer manner

Question 55

Q

What is tree pruning?

Answer

A

Identification and removal of branches that reflect noise or outliers

Question 56

Q

Give an example of web structure mining using (a) links and (b) generalisation

Answer

A

a) PageRank: assignment of weights to pages using interconnections between pages
b) VWV: multi-level database representation of the web

Question 57

Q

What is Singular Value Decomposition (SVD)?

Answer

A

a matrix factorization method that decomposes a matrix into three other matrices: U, S, and V.

Question 58

Q

What are some major difficulties of keyword-based retrieval?

Answer

A

Synonymy: A keyword does not appear anywhere in the document, even though the document is closely related to the keyword
Polysemy: The same keyword may mean different things in different contexts

Question 59

Q

what is the strategy for multidimensional analysis of complex data objects?

Answer

A

generalise the plan-base in different directions
look for sequential patterns in the generalised plans
derive high-level plans

Question 60

Q

what is a plan and plan mining

Answer

A

a plan is a variable sequence of actions
plan mining is extracting significant generalised (sequential) patterns from a plan-base (large collection of plans)

Question 61

Q

Why Decision Tree Induction in data mining?

Answer

A

relatively faster learning speed (than other classification methods)
convertible to simple and easy-to-understand classification rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods

Question 62

Q

List some enhancements to basic decision tree induction

Answer

A

Allow continuous-valued attributes
handle missing attribute values
attribute construction

Question 63

Q

What is the general methodology of Association-Based Document Classification

Answer

A

Extract keywords and terms by information retrieval and association analysis techniques
Obtain the concept hierarchies, then perform classification and association mining methods

Question 64

Q

Keyword-based association analysis definition

Answer

A

Collect sets of keywords that occur frequently together and then find the association or correlation relationships among them

Answer 65

A

Create a term frequency matrix
SVD construction
Vector Identification
Index creation

Answer 66

A

use dimension tables to generalise plan-base in a multidimensional way
cardinality determines the right level of generalisation (level planning)
use operators (merge + , option []) to further generalise patterns

Answer 67

A

rough spatial computation (as a filter)
detailed spatial algorithm (as a refinement)

Answer 68

A

Whole matching
subsequence matching

Answer 69

A

discover clusters of arbitrary shape
handle noise
one scan
need density parameters as stopping condition

Answer 70

A

deterministic annealing
genetic algorithms

Answer 71

A

Problem Definition
Data gathering and preparation
Model building and evaluation
Knowledge deployment

Answer 72

A

Perfect, not perfect, inspection, soft.

Answer 73

A

reducing the number of values for a continuous variable by dividing the range into intervals, replacing the actual values with interval labels.

Answer 74

A

mean, median, midrange

Answer 75

A

quantiles, IQR, variance

Answer 76

A

min, q1, median, q3, max

Answer 77

A

Divide the range into N intervals of equal size
Width = (Max-Min)/N

Answer 78

A

Divide the range into N intervals, each containing appx. the same number of objects

Answer 79

A

correlation-based analysis

Answer 80

A

Histograms: Divide the data into buckets and store average for each bucket
Clustering: partition the dataset into clusters, and one can store cluster representation only
Stratified Sampling: approximate the percentage of each class in the overall database to choose a representative subset of the data

Answer 81

A

collect and replace low level concepts (numerical age) by higher level concepts (young, old)

Answer 82

A

A set of views over operational databases

Answer 83

A

Organised around major subjects, such as customer, product, sales
Focusing on the modelling and analysis of data for decision makers, not on daily operations or transaction processing
Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process

Answer 84

A

integrates multiple heterogenous data sources
apply techniques of data cleaning and integration

Answer 85

A

the data provides information from a historical perspective rather than current value data
every key structure in the DW contains an element of time, explicitly or implicitly

Answer 86

A

once data is loaded in, it is typically not subject to frequent changes or updates. The data remains relatively stable and unchanged over time.

Answer 87

A

OLTP is a major task of traditional relational DB
used by IT for day-to-day operations

Answer 88

A

A merged region may contain hundreds of primitive regions (polygons)

Answer 89

A

Spatial Areas

Answer 90

A

Dimensions: Region_name, time, precipitation, temperature
Measurements: region_map, area, count

Answer 91

A

Selective computation: Only materialise spatial objects that will be accessed frequently

Answer 92

A

DB used for day-to-day operations using OLTP
DW used for data analysis using OLAP

Answer 93

A

Generalise detailed geographic points into clustered regions, such as business, residential,
industrial, or agricultural areas, according to land usage
requires the merge of a set of geographic areas by spatial operations

Answer 94

A

Fact Table with the four dimensions and 3 measures:
Time: time_key, day, month…
Region: region_key, name, location, city,…
Temperature: temp_key, range, temp_value, description
Precipitation: key, range, value, description

Measures: map, area, count

Answer 95

A

Input:

a map with weather probes scattered around in an area
daily weather data
concept hierarchies for all attributes

Output:

a map that reveals patterns: merged (similar) regions

Answer 96

A

Store candidate itemsets in a hash-tree

leaf node of tree contains a list of itemsets and counts
Interior node contains a hash table
subset function: finds all the candidates contained in a transaction