DM Exam Paper Qs Flashcards

1
Q

Give a definition of an outlier

A

A data point that lies unusually far from the majority of datapoints

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Give an example of an outlier in credit card transactions

A

A fraudulent transaction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Propose two methods that can be used to detect outliers

A
  • Clustering
  • combined computer and human inspection
  • regression
  • box plots
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which outlier detection method is the most reliable

A

Combined computer and human inspection is the most reliable as the computer detects the suspicious values and they are checked by the human, this two-step process is more reliable than just relying on an algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Steps of K-means algorithm

A
  1. randomly select k initial cluster centroids
  2. assign every item to its nearest centroid
  3. recompute the centroids
  4. Repeat from step 2 until no reassignments occur
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Give three application examples of spatiotemporal data streams

A
  • traffic monitoring
  • environmental monitoring
  • weather pattern analysis
  • asset tracking in logistics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Discuss what kind of interesting knowledge can be mined from such data streams, with limited time and resources.

A

Summarisation and aggregation of data at different levels of granularity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Identify and discuss the major challenges in spatiotemporal data mining

A

High Dimensionality: it involves multiple dimensions (space and time) for each data point

  • Analysing and visualising can be computationally intensive
  • careful consideration for data structure
  • choice of computation method (selective, partial, full)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Using one application example, sketch a method to mine one kind of knowledge from such stream data efficiently.

A
  • discuss the inputs and outputs of weather data warehouse
  • Draw star schema of weather warehouse
  • Use OLAP for efficient data analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the bottleneck of Apriori Algorithm

A

Candidate Generation
- very large candidate sets
- multiple scans of the database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Briefly explain a method to improve Apriori’s efficiency

A

Transaction reduction: A transaction that does not contain any frequent itemsets is useless in subsequent scans
Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Apriori Algorithm Steps

A

Given a Database D and a minimum support MS:
1. Initial scan of D to get the frequencies of each individual item (Candidate generation)
2. Eliminate candidates that don’t have MS
3. Construct new candidates by combining eligible candidates (resulting itemset must be only one item larger)
4. Repeat pruning and construction until all frequent itemsets are found

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Give the necessary steps of the FP-growth algorithm.

A
  1. Scan dataset and construct frequency table of each item
  2. Eliminate items without minimum support
  3. Transform the transactions ordering the frequent items in descending order
  4. Construct tree in top-down recursive fashion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a data cube and how is it formed?

A

A multidimensional data model views data in the form of a data cube.
It is formed from the lattice of cuboids and allows data to be modelled and viewed in multiple dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a fact table

A

Contains keys to each of the related dimension tables and measures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Give a definition for each of the three categories of measures that can be used for the data warehouse

A

Distributed: can be computed independently on partitions of the data (e.g. count)
Algebraic: involve combining results from distributive functions in a structured way (e.g. avg)
Holistic: require analyzing the entire dataset as a whole due to their complexity and lack of constant bounds on storage size (e.g. median)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Discuss how to efficiently calculate the top 10 values of a feature in a data cube that is partitioned into multiple chunks

A

Get the maximum of all the chunks, remove it and repeat the process 10 times

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Define the notions of support and confidence

A

rule: Y → Z
Support: probability that a transaction contains {Y → Z}
Confidence: Conditional probability that a transaction having Y also contains Z

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explain why holistic measures are not desirable when designing a data warehouse.

A
  • provide: valuable insights
  • trade-offs: computational complexity, scalability, storage requirements
  • less desirable in the context of a data warehouse, where performance and responsiveness are critical considerations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

K-means complexity calculation

A

O(tkn)
t: number of iterations
k: number of centroids
n: number of data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What kind of problem is k-means clustering and what does it mean

A

It is NP-hard, which means that it is at least as hard as any NP-problem, although it might, in fact, be harder

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Describe the possible negative effects of proceeding directly to mine the data that has not been pre-processed.

A
  • inaccurate results
  • overfitting
  • data inconsistency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Define Information Retrieval

A

the process of obtaining relevant information from a large repository of data, typically in the form of documents or records, in response to a user’s information need

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Define Precision and Recall

A

Precision: The percentage of retrieved documents that are in fact relevant to the query
Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the main difference between clustering and classification

A

Clustering is an unsupervised learning technique which groups the data.
Classification is a supervised learning technique that predicts the class or value of unseen data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Describe briefly the necessary steps for handling ordinal variables when computing dissimilarity measures.

A

Convert the ordinal categories into numerical values while preserving the ordinal relationship. Assign integers to ordinal categories based on their order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How to calculate the dissimilarity matrix for an attribute variable

A
  1. Calculate dissimilarity: for each pair of observations, calculate the dissimilarity using the chosen distance metric.
  2. Form the matrix: create a square matrix where each element represents the dissimilarity between two observations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Define the notion of a frequent itemset

A

Any subset of a frequent itemset must be frequent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

The FP-growth algorithm adopts the strategy of divide-and-conquer. What is the main advantage of this strategy over the one used in the Apriori algorithm?

A

Its compactness.
- reduces irrelevant info: no infrequent items
- descending ordering of frequencies
- result is never larger than the original dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Describe briefly the data mining process steps

A
  1. Data preprocessing: Gathering, Cleaning, transformation, integration
  2. Data Analysis: Model building and evaluation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

It was claimed that the clustering can be used for both pre-processing and data analysis. Explain the difference between the two applications of clustering

A
  • sampling in Data Preprocessing
  • gain insights in Data Analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Explain the main differences between data sampling and data clustering during the pre-processing step

A
  • sampling is using representatives from the clusters
  • clustering is the process of identifying the clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Explain how a query-driven approach is applied on heterogeneous datasets.

A

A wrapper is built on top of the database’s meta-dictionary, the MD solves the inconsistencies in the dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Give the advantages and disadvantages of an update-driven approach

A

ADV: faster processing, data is stored and structured
DISADV: small datasets create overhead when the data is homogeneous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Explain why an update-driven approach is preferred to a query-driven approach

A

When you have a large dataset of heterogeneous sources, it provides faster processing with potentially lower costs in the long term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Describe situations where a query-driven approach is preferable to an update-driven approach in pre-processing

A

when the dataset is small or consists of homogeneous sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

How to draw a snowflake schema

A

Extend the main Fact Table to its nested fact tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Give a brief description of the core OLAP operations

A
  • Roll up (drill-up): summarise data by climbing up hierarchy or dimension reduction
  • Drill down (roll down): reverse of roll-up
  • Slice and dice: project and select
  • Pivot (rotate): reorientate the cube
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

One of the benefits of the FP-tree structure is Compactness. Explain why FP-growth method is compact

A
  • reduces irrelevant info: no infrequent items
  • descending ordering of frequencies: more frequent items are more likely to be shared
  • never larger than the original dataset
40
Q

Explain how the FP-growth method avoids the two costly problems of the Apriori algorithm

A

The two costly problem of huge candidate sets and multiple scans of the database are avoided by scanning the database once.

41
Q

Can you always find an optimal clustering with k-means? Justify your answer

A

No, as optimal clustering relies on a certain initialisation of the centroids.

42
Q

Illustrate the strength and weakness of k-means in comparison with a hierarchical clustering scheme (e.g. AGNES)

A
  • initial initialisation: AGNES is robust
  • cluster shapes: AGNES is more flexible
  • outlier sensitivity: K-means is very sensitive
  • specification of number of clusters
  • k-means is less expensive
43
Q

Illustrate the strength and weakness of k-means in comparison with k-medoids

A
  • outliers: k-medoids is more robust
  • cluster shapes: k-medoids does not assume spherical shapes
  • computational complexity: k-medoids is higher
44
Q

what is the basic methodology of Latent Semantic Indexing

A

Create a frequency table

Use a singular value decomposition (SVD) technique to reduce the size of the frequency table, then retain the most significant rows

45
Q

What does DBSCAN discover, and what does it rely on?

A
  • clusters of arbitrary shape in spatial datasets with noise
  • a density-based notion of cluster
46
Q

Properties of a spatial data warehouse

A

Same as DW; integrated, subject-oriented, time-variant, and non-volatile

47
Q

What are the two parameters in density-based clustering

A

Eps: max radius of the neighbourhood
MinPts: min number of points in an eps-neighbourhood of that point

48
Q

List clustering applications

A
  • Pattern recognition
  • image processing
  • economic science
  • spatial data analysis
  • WWW
49
Q

What is the simplified assumption in the Naive Bayes Classifier

A

attributes are conditionally independent

50
Q

Bayesian Theorem Formula

A

P(A|B) = P(B|A) P(A) / P(B)

51
Q

Briefly explain the two methods to avoid overfitting in Decision Trees

A

Post-pruning: Remove branches from a “fully grown” tree - get a sequence of progressively pruned trees
Pre-pruning: Halt tree construction early - do not split a node if this would result in the goodness measure falling below a threshold

52
Q

How is an attribute split performed using the Gini Index?

A

The attribute that provides the smallest Gini Split is chosen to split the node

53
Q

What does the Gini index measure?

A

The impurity or purity of the dataset

54
Q

How is a Decision Tree constructed using a basic algorithm (a greedy algorithm)?

A

In a top-down recursive divide-and-conquer manner

55
Q

What is tree pruning?

A

Identification and removal of branches that reflect noise or outliers

56
Q

Give an example of web structure mining using (a) links and (b) generalisation

A

a) PageRank: assignment of weights to pages using interconnections between pages
b) VWV: multi-level database representation of the web

57
Q

What is Singular Value Decomposition (SVD)?

A

a matrix factorization method that decomposes a matrix into three other matrices: U, S, and V.

58
Q

What are some major difficulties of keyword-based retrieval?

A
  • Synonymy: A keyword does not appear anywhere in the document, even though the document is closely related to the keyword
  • Polysemy: The same keyword may mean different things in different contexts
59
Q

what is the strategy for multidimensional analysis of complex data objects?

A
  1. generalise the plan-base in different directions
  2. look for sequential patterns in the generalised plans
  3. derive high-level plans
60
Q

what is a plan and plan mining

A
  • a plan is a variable sequence of actions
  • plan mining is extracting significant generalised (sequential) patterns from a plan-base (large collection of plans)
61
Q

Why Decision Tree Induction in data mining?

A
  • relatively faster learning speed (than other classification methods)
  • convertible to simple and easy-to-understand classification rules
  • can use SQL queries for accessing databases
  • comparable classification accuracy with other methods
62
Q

List some enhancements to basic decision tree induction

A
  • Allow continuous-valued attributes
  • handle missing attribute values
  • attribute construction
63
Q

What is the general methodology of Association-Based Document Classification

A
  • Extract keywords and terms by information retrieval and association analysis techniques
  • Obtain the concept hierarchies, then perform classification and association mining methods
64
Q

Keyword-based association analysis definition

A

Collect sets of keywords that occur frequently together and then find the association or correlation relationships among them

65
Q

what is the step-by-step method for performing Latent Semantic Indexing

A
  1. Create a term frequency matrix
  2. SVD construction
  3. Vector Identification
  4. Index creation
66
Q

What are the purposes of dimension tables, cardinality, and operators in generalisation-based sequence mining

A
  • use dimension tables to generalise plan-base in a multidimensional way
  • cardinality determines the right level of generalisation (level planning)
  • use operators (merge + , option []) to further generalise patterns
67
Q

what are the steps for mining spatial association

A
  1. rough spatial computation (as a filter)
  2. detailed spatial algorithm (as a refinement)
68
Q

two categories of similarity queries in time-series analysis

A
  • Whole matching
  • subsequence matching
69
Q

List major features of density-based clustering methods

A
  • discover clusters of arbitrary shape
  • handle noise
  • one scan
  • need density parameters as stopping condition
70
Q

in K-means, the global optimum may be found using what techniques?

A
  • deterministic annealing
  • genetic algorithms
71
Q

List the Data Mining Tasks

A
  1. Problem Definition
  2. Data gathering and preparation
  3. Model building and evaluation
  4. Knowledge deployment
72
Q

List the four Data Quality Types in order

A

Perfect, not perfect, inspection, soft.

73
Q

What is data discretisation?

A

reducing the number of values for a continuous variable by dividing the range into intervals, replacing the actual values with interval labels.

74
Q

List the three central tendency statistics

A

mean, median, midrange

75
Q

List the three data dispersion statistics

A

quantiles, IQR, variance

76
Q

Five Number Summary

A

min, q1, median, q3, max

77
Q

How to do equal-width partitioning?

A

Divide the range into N intervals of equal size
Width = (Max-Min)/N

78
Q

How to do equal-depth partitioning?

A

Divide the range into N intervals, each containing appx. the same number of objects

79
Q

name a method to detect redundant data

A

correlation-based analysis

80
Q

Give an example a non-parametric method for achieving numerosity reduction

A

Histograms: Divide the data into buckets and store average for each bucket
Clustering: partition the dataset into clusters, and one can store cluster representation only
Stratified Sampling: approximate the percentage of each class in the overall database to choose a representative subset of the data

81
Q

How to use concept hierarchies for data reduction

A

collect and replace low level concepts (numerical age) by higher level concepts (young, old)

82
Q

What is a Virtual Data Warehouse?

A

A set of views over operational databases

83
Q

How is a DW subject-orientated?

A
  • Organised around major subjects, such as customer, product, sales
  • Focusing on the modelling and analysis of data for decision makers, not on daily operations or transaction processing
  • Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
84
Q

How is a DW integrated?

A
  • integrates multiple heterogenous data sources
  • apply techniques of data cleaning and integration
85
Q

How is a DW time variant?

A
  • the data provides information from a historical perspective rather than current value data
  • every key structure in the DW contains an element of time, explicitly or implicitly
86
Q

How is a DW non-volatile?

A

once data is loaded in, it is typically not subject to frequent changes or updates. The data remains relatively stable and unchanged over time.

87
Q

What is Online Transaction Processing?

A
  • OLTP is a major task of traditional relational DB
  • used by IT for day-to-day operations
88
Q

What is a challenge in Weather Pattern Analysis when using a spatial data warehouse

A

A merged region may contain hundreds of primitive regions (polygons)

89
Q

What are Polygons in the context of spatial data warehouses

A

Spatial Areas

90
Q

What Dimensions and measurements are in the Fact Table of a Spatial Data Warehouse for mining weather pattern analysis

A
  • Dimensions: Region_name, time, precipitation, temperature
  • Measurements: region_map, area, count
91
Q

What is a reasonable choice for choice of computation method for Spatial Data Cubes

A

Selective computation: Only materialise spatial objects that will be accessed frequently

92
Q

What is the difference between a traditional DB and a Data Warehouse

A
  • DB used for day-to-day operations using OLTP
  • DW used for data analysis using OLAP
93
Q

How to generalise spatial data, and what does it require

A
  • Generalise detailed geographic points into clustered regions, such as business, residential,
    industrial, or agricultural areas, according to land usage
  • requires the merge of a set of geographic areas by spatial operations
94
Q

Structure of star schema for weather warehouse

A

Fact Table with the four dimensions and 3 measures:
Time: time_key, day, month…
Region: region_key, name, location, city,…
Temperature: temp_key, range, temp_value, description
Precipitation: key, range, value, description

Measures: map, area, count

95
Q

Inputs and output of Weather Spatial Data Warehouse

A

Input:

  • a map with weather probes scattered around in an area
  • daily weather data
  • concept hierarchies for all attributes

Output:

  • a map that reveals patterns: merged (similar) regions
96
Q

Method to efficiently generate candidate sets

A

Store candidate itemsets in a hash-tree

  • leaf node of tree contains a list of itemsets and counts
  • Interior node contains a hash table
  • subset function: finds all the candidates contained in a transaction