Revision Flashcards

1
Q

What is a Data Cube?

A
  • Data cubes are the building blocks of multidimensional models on which a data warehouse is based
  • A data cube allows data to be modelled and viewed in multiple dimensions.
  • A data cube is defined by dimensions and facts. -
  • Dimensions are perspectives or entities that an organization wants to keep. For example, if our data model was related to Sales, we might have dimension tables such as item (item_name, brand, type) or time(day, week, month, quarter, year).
  • The fact table will contain measures such as Euros_sold and keys to each of the related dimension tables. A data cube is really a lattice of cuboids.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a base cuboid?

A

An n-dimensional base cube (e.g. (time, item, location, supplier)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the apex cuboid?

A

Topmost 0-D cuboid which holds the highest level of summarization (all).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a data cube in terms of cuboids?

A

A lattice of cuboids.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a Fact Table?

A

A fact table is a large central table with the bulk of the data, contains no redundancy and is connected to a set of smaller attendant tables (dimension tables).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a distributive measure?

A

If the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning, e.g. count(), sum(), min(), max()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an algebraic measure?

A

If it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function. E.g avg(), min_N(), standard_deviation()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a holistic measure?

A

If there is no constant bound on the storage size needed to describe a sub-aggregate. In other words, we need to look at all the data. E.g. median(), mode(), rank()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A distributive measure

A

An aggregate function is distributive if it can be computed in a distributed manner as follows. Suppose the data are partitioned into n sets. We apply the function to each partition, resulting in n aggregate values. If the result derived by applying the function to the n aggregate values is the same as that derived by applying the function to the entire dataset (without partitioning), the function can be computed in a distributed manner.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

An example of a distributive measure.

A

For example, sum() can be computed for a data cube by first partitioning the cube into a set of subcubes, computing sum() for each subcube, and then summing up the counts obtained for each subcube. Hence, sum() is a distributive aggregate function. For the same reason, count(), min(), and max() are distributive aggregate function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A measure is distributive if…

A

it is obtained by applying a distributive aggregate function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Can distributive measures be computed efficiently?

A

Yes, because of the way the computation can be partitioned.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

An example of an algebraic measure.

A

For example, avg() can be computed by sum()/count() where both sum() and count() are distributive aggregate functions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Another example of an algebraic measure.

A

standard_deviation()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

A measure is algebraic if…

A

it is obtained by applying an algebraic aggregate function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe a holistic function.

A

An aggregate function is holistic if there is no constant bound on the storage size needed to describe a subaggregate. That is, there does not exist an algebraic function with M arguments (where M is a constant) that characterizes the computation. Common examples of holistic functions include median(), mode(), and rank().

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

A measure is holistic if…

A

it is obtained by applying a holistic aggregate function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Explain why holistic measures are not desirable when designing a data warehouse.

A

Most large data cube applications require efficient computation of distributive and algebraic measures. Many efficient techniques exist for this. In contrast, it is difficult to compute holistic measures efficiently. Efficient techniques to approximate the computation of some holistic measures, however, do exist. For example, rather than computing the exact median(), techniques can be used to estimate the approximate median value for a large data set. In many cases, such techniques are sufficient to overcome the difficulties of efficient computation of holistic measures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are 2 costs of the Apriori algorithm?

A

The bottleneck of Apriori is candidate generation. Two costs:

  1. Possibly huge candidate sets
  2. Requires multiple scans of the database to count supports for each candidate by pattern matching
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

How does the Frequent Pattern Growth approach avoid the two costly problems of Apriori?

A
  1. Compresses a large database into a compact frequent pattern tree structure - highly condensed but complete for frequent pattern mining
  2. Avoids costs database scans
  3. Avoids candidate generation: sub-database test only
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why is the FP growth method compact?

A
  1. Reduces irrelevant information - infrequent items are gone.
  2. Frequency descending ordering - more frequent items are more likely to be shared
  3. Can never be larger than the original database (if not count node-links and counts)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Notion of ‘closeness’ in K-NN.

A

Measure of distance between K-observations and the test object. For numeric attributes, it is usually Euclidean distance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are the 3 main steps of a Data Mining process?

A

Assuming we have defined the problem around which we are developing a DM solution for, we can describe the 3 main steps of DM as:

  1. Data gathering + preparation
  2. Model building and evalution
  3. Knowledge deployment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Describe the tasks that need to be performed at each step of the DM process.

A
  1. Data gathering + preparation = data access, data sampling, data transformation
  2. Model building + evaluation = create model, test model, evaluate model, interpret model
  3. Knowledge deployment = Apply model, custom reports, external applications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is spatio-temporal data?

A

Spatiotemporal data are data that relate to both space and time. Spatiotemporal data mining refers to the process of discovering patterns and knowledge from spatiotemporal data. Typical examples of spatiotemporal data mining include:

  • discovering the evolutionary history of cities and lands
  • uncovering weather patterns
  • predicting earthquakes and hurricanes
  • determining global warming trends
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Give an example of spatio-temporal data

A

Moving-object data i.e. data about moving objects. E.g. telemetry equipment on wildlife to analyze ecologicla behaviour, mobility managers embed GPS monitors in cars to better monitor and guide vehicles, and meterologists use weather satellites and radars to observe hurricanes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is sequence data?

A
  1. time series
  2. symbolic sequence
  3. biological sequences
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is Sequential pattern mining?

A

A symbolic sequence consists of an ordered set of elements or events recorded with or without a concrete notion of time. E.g. customer shopping sequence data, web click streams, biological sequences, etc. Because biological sequence data carry very complicated and hidden semantic meaning and pose many challenging research issues, most investigations are conducted in the field of bioinformatics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is Sequential pattern mining?

A

A symbolic sequence consists of an ordered set of elements or events recorded with or without a concrete notion of time. E.g. customer shopping sequence data, web click streams, biological sequences, etc. Because biological sequence data carry very complicated and hidden semantic meaning and pose many challenging research issues, most investigations are conducted in the field of bioinformatics. And so, sequential pattern mining has focused mostly on mining symbolic sequences. A sequential pattern is a frequent subsequence existing in a single sequence or a set of sequences.
Mining of sequential patterns involves mining the set of subsequences that are frequent in one sequence or a set of sequences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is Text Mining?

A

An interdisciplinary field that draws on Information Retrieval, Data Mining, Machine Learning, Statistics & Computational Linguistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

List an important goal of Text Mining.

A

To define high-quality information from text. High-quality usually refers to a combination of:

  • relevance
  • novelty
  • interestingness
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is Information Retrieval.

A

IR is a field developed in parallel with database systems. Information is organised into (a large number) of documents.

IR problem: locating relevant documents based on user input, such as keywords or example documents.

Typical IR systems:

  1. Online library catalogue systems
  2. Online document management systems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

IR vs DB systems

A

Some DB problems are not present in IR. e.g. update, transaction management, complex objects, concurrency control, recovery.

Some IR problems are not addressed well in DBMS. e.g. unstructured documents, approximate search using keywords and the notion of relevance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Database Systems vs IR.

A

Database systems include the creation, maintenance and use of databases for organizations and end-users.

DB systems:

  • highly recognized principles in data models, query languages, query processing and optimization methods, data storage, indexing and accessing methods.
  • well-known for their scalability in processing very large, relatively structured data

IR:
- IR is the science of searching for documents or information in documents (documents can be text or multimedia, and may reside on the Web)

The difference between IR and traditional DB systems are twofold:

1) IR assumes that the data under search are unstructured
2) in IR, the queries are formly mainly by keywords which do not have complex structures (unlike SQL queries in database systems)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Database Systems vs IR.

A

Database systems include the creation, maintenance and use of databases for organizations and end-users.

DB systems:

  • highly recognized principles in data models, query languages, query processing and optimization methods, data storage, indexing and accessing methods.
  • well-known for their scalability in processing very large, relatively structured data

IR:
- IR is the science of searching for documents or information in documents (documents can be text or multimedia, and may reside on the Web)

The difference between IR and traditional DB systems are twofold:

1) IR assumes that the data under search are unstructured
2) in IR, the queries are formed mainly by keywords, which do not have complex structures (unlike SQL queries in DB systems).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What is a Data Warehouse?

A

A data warehouse integrates data from multiple sources and various timeframes. It consolidates data in multidimensional space to form partially materialized data cubes. The data cube model not only facilitates OLAP in multidimensional databases but also promotes multidimensional data mining.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is a language model in IR?

A

The typical approaches in IR adopt probabilistic models. For example, a text document can be regarded as a bag of words, that is, a multiset of words appearing in the document. The document’s language model is the probability density function that generates the bag of words in the document. The similarity between two documents can be measured by the similarity between their corresponding language models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is a topic model in terms of IR?

A

A topic in a set of text documents can be modelled as the probability distribution over the vocabulary. A text document, which may involve one or multiple models, can be regarded as a mixture of multiple topic models.

By integrating IR models and DM techniques, we can find the major topics in a collection of documents, and for each document in the collection, the major topics involved.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is a data warehouse?

A

Loosely speaking, a data warehouse refers to a data repository that is maintained separately from an organization’s operational databases.

  • Data warehouse systems allow for integration of a variety of application systems.
  • They support information processing by providing a solid platform of consolidating historic data for analysis.
  • Data warehouses generalize and consolidate data in a multidimensional space
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Define a data warehouse.

A

A data warehouse is a semantically consistent data store that serves as a physical implementation of a decision support data model.

  • provides online analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitates effective data generalization and data mining.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

How can business analysts benefit from a data warehouse?

A

DWs provide the architecture and tools for business executives to systemtically organize, understand, and use their data to make strategic decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What’s a data warehouse?

A

It is a data repository architecture - a repository of multiple heterogenous data sources organized under a unified schema at a single site to facilitate management-decision making.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

William H. Inman’s definition of a data warehouse

A

“A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What are the main characteristics of a data warehouse/

A
  1. Subject-oriented
  2. Integrated
  3. Time-variant
  4. Non-volatile
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What does ‘subject-oriented’ in terms of data warehouses mean?

A

A data warehouse is organized around major subjects, e.g. customer, supplier, product, sales
It excludes data that is not relevant to the decision-making process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What does ‘integrated’ mean in terms of data warehouses?

A

That data comes from multiple heterogenous sources, e.g. flat files, relational databases, online transaction records. These data usually require data cleaning and integration techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What does ‘time-variant’ in terms of data warehouses mean?

A

That data are stored to provide information from a historical perspective, e.g. past 5-10 years. Also, data is typically summarized, e.g. organized per item type or per region.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What does ‘non-volatile’ mean in terms of a data warehouses?

A

A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. All that’s required for the data warehouses is the initial loading and access of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Differences between traditional DBs and Data Warehouses.

A

DBs. || Data Warehouses

  1. OLTP vs OLAP
  2. Day-to-day operations of an organization such as purchasing, inventory, manufacturing, banking, payroll, registration, accounting VS. serves users or knowledege workers in the role of data analysis and decision-making
  3. Customer-oriented (used for transaction and query processing by clerks, clients, and IT professionals) VS. Market-oriented (used for data analysis by knowledge workers including managers, executives & analysts)
  4. Current data (typically too detailed to be easily used for decision-making) VS. Historic data
  5. Adopts ER model and application-oriented db design VS. Adopts a STAR or SNOWFLAKE model and a subject-oriented database design
  6. Short, atomic transactions VS. Facilitation for summarization & aggregation and stores & manages info at different levels of granularity
  7. Requires concurrency control and recovery mechanisms VS. Allowing read-only operations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

KDD

A

Knowledge discovery from data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is meant by OLAP processing?

A

Analysis techniques with functionalities such as summarization, aggregation and consolidation, as well as the ability to view information from different angles.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is Data Mining?

A
  • misnomer -> should have been more appropriately named ‘knowledge mining from data’
  • synonym for KDD
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What are the steps involved in the KDD process?

A
  1. Data cleaning
  2. Data integration
  3. Data selection
  4. Data Transformation
  5. Data mining
  6. Pattern evaluation
  7. Knowledge representation

Data mining is just one step of the KDD process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Enterprise Data Warehouse

A

Collects all info about subjects spanning the entire organization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Data Mart

A

Subset of corporate-wide data that is of value to a specific group of users. It’s scope is confined to specific, selected groups, such as marketing data mart.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Virtual Warehouse

A

A set of views over operational databases -> requires excess capacity on databases

Only some of the possible summary views may be materialized.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Define Data Mining.

A

DM is the extraction of interesting (non-trivial, implicit, previously unknown, and potentially useful) information or patterns from large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Data mIning

A

The process of discovering interesting patterns and knowledge from large amounts of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What is DM not?

A
  1. (Deductive) query processing

2. Expert systems or small ML/statistical programs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What kind of data is used in DM

A
  1. DBs
  2. DW
  3. Transaction data
  4. Other kinds of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What is the process of constructing a Data Warehouse?

A

Data cleaning, integration, transformation, loading and periodic data refreshing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What is a Data Cube?

A

A data warehouse is usually modeled by a multidimensional data structure, called a
data cube, in which each dimension corresponds to an attribute or a set of attributes
in the schema, and each cell stores the value of some aggregate measure such as count or sum(sales amount)

A data cube provides a multidimensional view of data and allows the precomputation and fast access of summarized data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Why are OLAP operations useful?

A

OLAP. Online analytical
processing operations make use of background knowledge regarding the domain of
the data being studied to allow the presentation of data at different levels of abstraction.

Such operations accommodate different user viewpoints

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Potential Applications of DM

A

Database analysis & decision support
> market analysis & management
> risk analysis & management
> fraud detection and management

Other applications
> Text mining (news group, email, documents) and Web analyses
> Intelligent query answering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What types of data do we perform DM on?

A
Besides relational database data, data warehouse data, and transaction data, there are
many other kinds of data that have versatile forms and structures and rather different
semantic meanings. These include:
- time-related or sequence data
- data streams
- spatial data
- engineering design data
- hypertext or multimedia data
- graph and network data
- the Web
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

List some DM functionalities

A
  • Characterization & discrimination
  • frequent pattern mining, associations and correlations
  • classification & regression
  • clustering analysis
  • outlier analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Interesting patterns represent…?

A

Knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

What are frequent patterns?

A
  • Patterns that occur frequently in the data

Many kinds:

  • frequent itemsets
  • frequent subsequences (sequential patterns)
  • frequent substructures
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

What is a frequent itemset?

A

A set of items that often appear together in a transactional dataset, e.g. milk and bread are frequently bought together in grocery stores by many customers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

What is frequent subsequence?

A

e.g the pattern that customers tend to purchase a laptop first, then a digital camera, then a memory card is a frequent sequential pattern.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

What is a frequent sub-structure?

A

A substructure can refer to different structural forms (e.g. graphs, trees, lattices) that may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured pattern.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

Define support as a measure.

A

For example,

(buys.X, “computer”) => (buys.X, “software”) [support = 1%, confidence = 50%]

In this scenario, support refers to the fact that 1% of all transactions under analysis show that computer and software are purchased together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

Define confidence as a measure.

A

For example,

(buys.X, “computer”) => (buys.X, “software”) [support = 1%, confidence = 50%]

In this case, a 50% confidence means that there is a 50% chance that when a customer purchases a computer, that they will also purchase software.

Typically, association rules are regarded as uninteresting if do not satisfy minimum support threshold and a minimum confidence threshold.

74
Q

What is a decision tree?

A

A flow chart-like structure, where each node denotes a test on an attribute value, each branch represents an outcome of the test, and tree leaves represent classes or class distributions.

75
Q

How are objects clustered or grouped in clustering?

A

Based on the principle of maximizing the intraclass similarity and minimizing the inter-class similarity. Objects within a cluster are similar to each other, and are dissimilar to objects in other clusters.

76
Q

Applications of DM in Market Analysis.

A

Data sources: credit card transactions, loyalty cards, discount coupons, customer complaint calls, public lifestyle studies

Target marketing: find clusters or model customers who share the same characteristics

  • Determine customer purchasing patterns over time
  • Cross-market analysis (associations/correlations between product sales, prediction based on the association information)
77
Q

Applications of DM in Market Management

A
  1. Customer profiling (what types of customers buy what products)
  2. Identifying customer requirements
  3. Providing summary information
78
Q

Applications of DM for Corporations

A
  1. Finance planning and asset evaluation
  2. Resource planning
  3. Competition
79
Q

Applications of DM for Fraud Detection

A
  • Auto insurance
  • Money laundering
  • Medical insurance
  • Detecting inappropriate medical treatment
  • Detecting telephone fraud
  • Retail
  • Sports
  • Astronomy
  • Internet Web Surf-Aid
80
Q

What makes a pattern interesting?

A
  1. Easily understood by humans
  2. Valid on new or test data with some degree of certainty
  3. Potentially useful
  4. Novel
  5. Validates a hypothesis that the user sought to confim
81
Q

What are some objective measures of interestingness?

A
  1. Support

2. Confidence

82
Q

Define support.

A

Represents the % of transactions from a transaction database that the given rule satisfies.

P(X U Y) where (X U Y) is the union of the itemsets X and Y

83
Q

Define confidence.

A

Assesses the degree of certainty of the detected association.

P(X|Y) or, the probability that a transaction containing X also contains Y.

84
Q

Define a data warehouse.

A

Integrates data from multiple sources and various timeframes. It consolidates data in multi-dimensional to form partially materialized data cubes. The data cube model not only facilitates OLAP in multidimensional databases but also promotes multi-dimensional data mining.

85
Q

What is Data Mining?

A

The process of discovering interesting patterns from massive amounts of data.

As KDD process, it typically involves data cleaning, data integration, data selection, data transformation, pattern discovery, pattern evaluation and knowledge presentation.

86
Q

What is classification?

A

A form of data analysis that extracts models describing important data classes. Such models are called classifiers and they predict categorical (discrete, unordered) data.

87
Q

Classification

A
  • predicts categorical class labels
  • classified data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it to classify new data
88
Q

Generally speaking, describe the process of classification.

A

Two-step process.

  1. Model construction - learning phase
  2. Model usage - classification step where the model is used to predict the class label of a given data
89
Q

Describe the first phase - the learning phase - of classifcation.

A
In the first step, a classifier is built describing a predetermined set of data classes or
concepts. This is the learning step (or training phase), where a classification algorithm
builds the classifier by analyzing or “learning from” a training set made up of database
tuples and their associated class labels. A tuple, X, is represented by an n-dimensional
attribute vector, X = (x1, x2, : : : , xn), depicting n measurements made on the tuple
from n database attributes, respectively, A1, A2, : : : , An. Each tuple, X, is assumed to
belong to a predefined class as determined by another database attribute called the class
label attribute. The class label attribute is discrete-valued and unordered. It is categorical
(or nominal) in that each value serves as a category or class. The individual tuples
making up the training set are referred to as training tuples and are randomly sampled
from the database under analysis. In the context of classification, data tuples can be
referred to as samples, examples, instances, data points, or objects.
90
Q

What is supervised learning?

A

The class label of each training tuple is provided.

The learning of the classifier is ‘supervised’ in that it is told to which class each training tuple belongs.

91
Q

Unsupervised learning

A

The class label of each training tuple is not known. The number or set of classes to be learned may not be known in advance.

92
Q

First phase of classification

A

Learning of a mapping or function y=f(x) that can predict the associated class label y of a given tuple X.

93
Q

Second phase of classification.

A

In this step, the model is used for classification.

94
Q

Decision Tree Induction

A

The learning of decision trees from class-labeled training tuples. A decision-tree is a flow-chart like structure where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label.

95
Q

Internal nodes on a decision tree

A

denotes a test on an attribute

96
Q

Branches of a decision tree

A

denote an outcome of a test on an attribute

97
Q

How is a decision tree used for classification?

A

Test the attribute values of the sample against the decision tree.

Given a tuple X, for which the associated class label is unknown, the attribute values of the tuple are tested against the decision tree. A path is traced from the root to a leaf node, which holds the class prediction for that tuple. Decision trees can be easily converted to classification rules.

98
Q

What are some Data Preparation tasks that need to be done in Classification

A
  1. Data cleaning = preprocess data to remove noise and handle outliers
  2. Relevance analysis = feature selection/remove the irrelevant or redundant attributes
  3. Data transformation = generalize or normalize data
99
Q

Evaluating Classification Methods

A
  1. Accuracy of the prediction
  2. Speed and scalability = time to construct the model, time to use the model, efficiency in disk-resident databases
  3. Robustness = handling noise and missing values
  4. Interpretability = understanding and insight provided by the model
100
Q

Why are decision tree classifiers so popular?

A
  1. Does not require any domain knowledge or parameter setting - good for exploratory knowledge discovery
  2. Can handle multidimensional data
  3. Representation of human knowledge in tree form is intuitive and generally easy to assimilate by humans
  4. Learning + classification steps of decision tree induction are simple and fast.
  5. Generally, decision trees have very good accuracy.
101
Q

What are the main steps involved in the process of generating Decision Trees.

A
  1. Attribute Selection Measure - selects the attribute that best partitions the tuples into distinct classes
  2. Tree pruning - attempts to remove branches that represent noise or outliers with the goal of improving classification accuracy on unseen data.
102
Q

Describe how DTs work.

A

Most DT algorithms follow a top-down, recursive, divide-and-conquer approach.

Start off with a training set of tuples and their associated class labels. 
The training set is recursively partitioned into smaller subsets as the tree is being built.
103
Q

Basic algorithm for Decision Tree

A
  1. Tree is constructed in a top-down recursive divide-and-conquer manner
  2. At start, all the training examples are at the root
  3. Attributes are categorical (if continuous-valued, they are discretized in advance)
  4. Tuples are partitionned recursively based on selected attributes.
  5. Test attributes are selected on the basis of a heuristic or statistical measure (e.g. IG)

Stopping conditions for partioning:

  1. All samples for a given node belong to the same class
  2. There are no remaining attributes for further partitioning - majority voting
  3. No training tuples left
104
Q

Extracting Rules from Trees

A

Represent the knowledge in the form of IF-THEN rules

  • One rule is created for each path from the root to the leaf
  • Each attribute-value pair along a path forms a conjunction
  • The leaf node holds the class prediction
  • Rules are easier to understand for humans
105
Q

How to avoid overfitting in Decision Tree classification?

A

Two approaches:

  1. Pre-pruning = Halt tree construction early - do not split a node if this would result in the goodness measure falling below a threshold.
  2. Post-pruning = Remove branches from a “fully-grown” tree - get a sequence of progressively pruned trees
106
Q

How do we determine the final tree size?

A
  1. Separate training (2/3) and testing (1/3) sets
  2. Use cross-validation
  3. Use all data for training
  4. Use minimum description length principle
107
Q

Why is Decision Tree Induction useful in Data Mining?

A
  1. Relatively faster learning speed
  2. Convertible to simple and easy to understand classification rules
  3. Can use SQL queries for accessing databases
  4. Comparable classification accuracy with other methods
108
Q

What would be the problem with using Customer_ID as the splitting criterion in a Decision Tree?

A

“Although information gain is usually a good measure for deciding the
relevance of an attribute, it is not perfect. A notable problem occurs when
information gain is applied to attributes that can take on a large number of
distinct values. For example, suppose that one is building a decision tree for
some data describing the customers of a business. Information gain is often
used to decide which of the attributes are the most relevant, so they can be
tested near the root of the tree. One of the input attributes might be the
customer’s credit card number. This attribute has a high mutual information,
because it uniquely identifies each customer, but we do not want to include it
in the decision tree: deciding how to treat a customer based on their credit
card number is unlikely to generalise to customers we haven’t seen before
(overfitting).”

109
Q

What is required for Decision Tree construction?

A
  1. Data partition
  2. Attribute list
  3. Attribute selection measure (e.g. IG)
110
Q

What do internal nodes represent?

A

Internal nodes denote a test on an attribute

111
Q

What do leaf nodes represent?

A

Leaf nodes represent class labels

112
Q

What do branches represent?

A

The outcome of a test on attribute

113
Q

Explain the tree pruning step.

A

Tree pruning is used to identify and remove branches that represent noise or outliers. Its goal is to improve classification accuracy.

114
Q

Spatio-temporal data

A

Hurricane data, environmental data, global warming data

115
Q

What is a spatial data warehouse?

A

A spatial DW is an integrated, subject-oriented, time-variant, and non-volatile spatial data repository for data analysis and decision-making.

116
Q

Why is Spatial Data integration a big issue?

A
  1. Structure-specific formats (raster vs. vector-based, OO vs relational models, different storage and indexing, etc.).
  2. Vendor-specific formats (e.g. ESRI, MapInfo, Inter-graph, etc)
117
Q

Spatial data cube

A
  1. MD Spatial database

2. Both dimensions and measures may contain spatial components.

118
Q

What are some computation methods for spatial data cube?

A
  1. Online aggregation
  2. Pre-processing
  3. Selective materialization
119
Q

Spatial classification

A

Analyse spatial objects to derive classification schemes such as decision trees in relevance to certain spatial properties (e.g. district, highway, river)

120
Q

Spatial trend analysis

A
  1. Detect changes and trends along a spatial dimension
  2. Study the trend of non-spatial or spatial data changing with space
  3. Observe the trend of changes of the climate or vegetation with the increasing distance from the ocean.
121
Q

What is time-series data?

A

> consists of sequences of values or events changing with time
data is recorded at regular intervals
characteristic time-series components (trends, cycles seasonal, random)

122
Q

Time series data

A

Can be illustrated as a time-series graph which describes a point moving over time

123
Q

Categories of Time-Series movements

A
  1. Long-term or trend movements
  2. Cyclic movements or cycle variations e.g. business cycles
  3. Seasonal movements or seasonal variations e.g almost identical patterns that a time series appears to follow during corresponding months of successive years.
  4. Irregular or random movements (sporadic changes due to chance events e.g. labour disputes or personnel changes)
124
Q

How can we estimate the trend curve?

A
  1. Freehand method - fit the curve by looking at the graph, costly and barely reliable for large-scale data mining
  2. Least-square method - find the curve minimizing the sum of the squares of the deviation of points on the curve from the corresponding data points
  3. Moving-average method - Eliminate cyclic, seasonal and irregular patterns, loss of end data, sensitive to outliers
125
Q

What is sequential pattern mining?

A
  • Mining of frequently occurring patterns related to time or other sequences
  • Usually concentrates on symbolic patterns
    e. g. Renting ‘Star Wars’ then ‘Empire Strikes Back’ then ‘Return of the Jedi’ in that order.
126
Q

Applications of sequential pattern mining.

A
  • Target marketing & customer retention

- Weather prediction

127
Q

Sequential pattern mining parameters

A
  1. Duration of a time sequence T - sequential pattern mining can be confined to the data within a specified duration
  2. Event-folding window W - if W=T, time-insensitive patterns are found, if W=0 then each event occurs at a distinct time instant, and if 0 < W < T, sequences occurring within the same period W are folded in the analysis.
  3. Time interval, int, between events in the discovered pattern - int=0 no interval gap is allowed, i.e. only strictly consecutive sequences are found, if min_int <= int <= max_int then patterns are separated by at least min_int time and at most max_int time, and if int = c != 0 then patterns must occur at an exact interval.
128
Q

What is a DW?

A

A DW generalizes and consolidates data in an multidimensional space.

> Provides OLAP tools for the interactive analysis of MD data of varied granularities.

129
Q

Data Warehouse

A

A decision support database that is maintained separately from the organization’s operational database.

Supports information processing by providing a solid platform of consolidated, historical data.

130
Q

Inmon’s definition of a data warehouse:

A

“A data warehouse is a subject-oriented, time-variant, integrated and non-volatile collection of data in support of management’s decision-making process.”

131
Q

What is Data Warehousing?

A

The process of constructing and using data warehouses.

132
Q

Subject-oriented in DWs

A

> organised around major subjects e.g. customers, sales, products
focuses on the modelling and analysis of data for decision makers, not on daily operations or transaction processing
provides a simple and concise view around particular subject by excluding data not relevant to decision-making process.

133
Q

Integrated in DWs

A

> Integrating multiple heterogenous data sources (relational databases, flat files, on-line transaction records)
Applying techniques of data cleaning and integration (ensure consistency in naming conventions, encoding structures, attributes measures among different data sources).

134
Q

Time-variant in DWs

A

> The time horizon for a DWs is significantly longer than that of an operational system
While operational dbs contain current data, DW data is typically a historical perspective (e.g. past 5-10 years)
Every key structure in the DW contains an element of time, implicitly or explicitly

135
Q

Non-volatile in DWs

A

> Physically separate stores of data transformed from the operational environment
Operational update of data does not occur in the data warehouse environment
DW does not require transaction processing, recovery, or concurrency control mechanisms
Requires only 2 operations in data accessing: initial loading of data, and access of data
Read-only data

136
Q

DW vs. Hetergeneous Dbs

A
  1. Traditional dbs integration by building mediators/wrappers on top of the heterogenous dbs, query driven approach
  2. DWs are update-driven, high-performance, and the data from heterogeneous sources in integrated in advance and stored in warehouses for direct access and analysis
137
Q

OLTP

A

Online Transaction Processing

138
Q

OLAP

A

Online Analytical Processing

139
Q

OLTP vs OLAP

A

> OLTP is the major task of traditional relational DBs
Involves day-to-day operations e.g. purchasing, inventory, banking, manufacturing, payrolll, registration, accounting, etc.

> OLAP is the major task of DWs
OLAP = data analysis and decision-making

140
Q

OLTP vs OLAP

A
  1. Customer vs. Market-oriented
  2. Current, detailed data vs. historical, consolidated
  3. ER + application vs STAR + subject oriented design
  4. Current, local vs. evolutionary, integrated view
  5. Update vs. Read-only with complex queries
141
Q

Why should we separate DWs from DB systems?

A

High performance for both systems.

DBMS - tuned for OLTP (access methods, indexing, concurrency control, recovery)
Warehouse - tuned for OLAP (complex OLAP queries, MD view, consolidation)

142
Q

What’s different about the data in DW and DB?

A

> Decision support requires historical data which operational DBs do not typically have
DW requires consolidation (agg, summarization) from various data sources
Data quality: data needs to be reconciled from different sources.

143
Q

A DW is based on…

A

multidimensional data model which views data in the form of a data cube.

A data cube allows data to be modelled and viewed in multiple dimensions e.g. sales

144
Q

STAR Schema

A

A fact table in the middle connected to a set of dimension tables

145
Q

Snowflake Schema

A

A refinement of star schema where some dimensional hierarchy is normalised into a set of smaller dimension tables, forming a shape similar to a snowflake

146
Q

Fact constellation

A

Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation.

Fact constellations are typically used for data warehouses, as they can model multiple interrelated subjects

147
Q

How are data marts modelled?

A

Star or Snowflake schema

148
Q

Typical OLAP operations

A
  1. Roll-up - summarize data by climbing up a concept hierarchy or by dimension reduction
  2. Drill-down (roll-down) -> reverse of roll-up, from higher-level summary to lower level summary or detailed data, or introducing new dimensions
  3. Slice and Dice - project and select
  4. Pivot (rotate) - reorient cube, visualization, 3D to series of 2D planes
  5. Drill-across - involves more than one fact table
  6. Drill-through - through the bottom level of the cube to its back-end relational tables using SQL
149
Q

4 Views of a data warehouse

A
  1. Top-down
  2. Data sources
  3. Data warehouse
  4. Business query
150
Q

2 Approaches to DW Design

A
  1. Top-down: Starts with overall design and planning (mature)
  2. Bottom-up: Starts with experiments and prototypes (rapid)
151
Q

From an SE point of view, what are the steps involved in design a DW?

A

Planning, data collection, DW design, testing and evaluation, DW deployment

152
Q

SE point of view, DW design

A
  1. Waterfall: structured and systematic analysis at each step before proceeding to the next
  2. Spiral: Rapid generation of increasingly functional systems, short turn around time
153
Q

What’s involved in a typical data warehouse design process

A
  1. Choose the business process you want to model e.g. orders, invoices
  2. Choose the grain (atomic level of data) of the business process
  3. Choose the dimensions that will apply to each fact table record
  4. Choose the measures that will apply to each fact table record.
154
Q

OLAP Server architectures

A
  1. Relational OLAP
  2. Multidimensional OLAP
  3. Hybrid OLAP
  4. Specialised SQL servers
155
Q

Information Gain

A
  • Attribute Selection Measure
  • Based on pioneering work of Claude Shannon on information theory, which studied the value of ‘information content’ of messages

Let node N represent or hold the tuples of partition
D. The attribute with the highest information gain is chosen as the splitting attribute for
node N. This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or “impurity” in these partitions.
Such an approach minimizes the expected number of tests needed to classify
a given tuple and guarantees that a simple (but not necessarily the simplest) tree is
found.
The expected information needed to classify a tuple in D is given by
Info.D/ D 􀀀
Xm
iD1
pi log2.pi/,
where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and
is estimated by jCi,Dj/jDj. A log function to the base 2 is used, because the information
is encoded in bits. Info(D) is just the average amount of information needed to identify
the class label of a tuple in D. Note that, at this point, the information we have is based
solely on the proportions of tuples of each class. Info(D) is also known as the entropy
of D.
Now, suppose we were to partition the tuples in D on some attribute A having v distinct
values, fa1, a2, : : : , avg, as observed from the training data. If A is discrete-valued,
these values correspond directly to the v outcomes of a test on A. Attribute A can be used
to split D into v partitions or subsets, fD1, D2, : : : , Dvg, where Dj contains those tuples in
D that have outcome aj of A. These partitions would correspond to the branches grown
from node N. Ideally, we would like this partitioning to produce an exact classification
of the tuples. That is, we would like for each partition to be pure. However, it is quite
likely that the partitions will be impure (e.g., where a partition may contain a collection
of tuples from different classes rather than from a single class).
How much more information would we still need (after the partitioning) to arrive at
an exact classification? This amount is measured by
InfoA.D/ D
Xv
jD1
jDj j
jDj
Info.Dj/.
The term
jDj j
jDj acts as the weight of the jth partition. InfoA.D/ is the expected information
required to classify a tuple from D based on the partitioning by A. The smaller the
expected information (still) required, the greater the purity of the partitions.
Information gain is defined as the difference between the original information
requirement (i.e., based on just the proportion of classes) and the new requirement (i.e.,
obtained after partitioning on A). That is,
Gain.A/ D Info.D/􀀀InfoA.D/. (8.3)
In other words, Gain(A) tells us how much would be gained by branching on A. It is
the expected reduction in the information requirement caused by knowing the value of
A. The attribute A with the highest information gain, Gain.A/, is chosen as the splitting
attribute at nodeN. This is equivalent to saying that we want to partition on the attribute
A that would do the “best classification,” so that the amount of information still required
to finish classifying the tuples is minimal (i.e., minimum InfoA.D/).

156
Q

Overfitting

A

TODO

157
Q

Prune Step of Apriori algo

A

TODO

158
Q

Clustering

A

The process of grouping a set of data objects into multiple groups or clusters so that objects within a cluster have high similarity but are very dissimilar to objects in other clusters.

Dissimilarity and similarity are assessed based on the attribute values describing the objects and often involve distance measures.

159
Q

Unsupervised learning

A

Clustering

  • Learning my observations, rather than learning by examples
160
Q

Requirements of Clustering

A
  1. Scalability - working well on large datasets
  2. Ability to deal with different types of attribute - not just numeric
  3. Discovery of clusters with arbitrary shape
  4. Requirements for domain knowledge to determine input parameters
  5. Ability to deal with noisy data
  6. Incremental clustering and insensitivity to input order
  7. Capability of clustering high-dimensional data
  8. Constraint-based clustering
  9. Interpretability & usability
161
Q

3 important elements of clustering

A
  1. partitioning criteria - should be hierarchical or all clusters operate on same level
  2. separation of clusters - should clusters be mutually exclusive
  3. similiarity measure - determining the similarity between 2 objects by the distance between them e.g. absolute distance
  4. clustering space - using full dataset may not be useful with high-dimensional data -> subspace clustering
162
Q

Partition-based methods

A

Given a set of n objects, a partitioning method constructs k partitions of the data, where each partition represents a cluster and k n. That is, it
divides the data into k groups such that each group must contain at least one object.
In other words, partitioning methods conduct one-level partitioning on data sets.
The basic partitioning methods typically adopt exclusive cluster separation. That is,
each object must belong to exactly one group.

163
Q

Heuristic clustering methods

A
  • kmeans
  • kmedoids
  • work well for finding spherical shaped-clusters in small-to-medium databases
164
Q

Hierarchical methods

A
  • creates a hierarchical decomposition of the given set of data objects
  • agglomerative or divisive based on how the decomposition is performed
  • agglomerative approach is bottom-up => starts with one object per cluster and successively merges the objects or groups closeby until all groups are merged into one.
  • divisive approach is top-down - starts with all the objects in the same cluster and it is iteratively split into smaller clusters until eventually each object is in one cluster
  • hierarchical clustering can be distance-, density- or continuity-based
  • suffers from fact that once a step is done (merge, split), it cannot be undone
165
Q

Density-based methods

A

Their general idea
is to continue growing a given cluster as long as the density (number of objects or
data points) in the “neighborhood” exceeds some threshold. For example, for each
data point within a given cluster, the neighborhood of a given radius has to contain
at least a minimum number of points. Such a method can be used to filter out noise
or outliers and discover clusters of arbitrary shape.

166
Q

Grid-based methods

A

Quantize the object space into a finite number of cells that form a grid structure.

167
Q

Partitioning methods

A

– Find mutually exclusive clusters of spherical shape
– Distance-based
– May use mean or medoid (etc.) to represent cluster center
– Effective for small- to medium-size data sets

168
Q

Hierarchical methods

A

– Clustering is a hierarchical decomposition (i.e., multiple levels)
– Cannot correct erroneous merges or splits
– May incorporate other techniques like microclustering or
consider object “linkages”

169
Q

Density-based methods

A

– Can find arbitrarily shaped clusters
– Clusters are dense regions of objects in space that are
separated by low-density regions
– Cluster density: Each point must have a minimum number of
points within its “neighborhood”
– May filter out outliers

170
Q

Grid-based methods

A

– Use a multiresolution grid data structure
– Fast processing time (typically independent of the number of
data objects, yet dependent on grid size)

171
Q

how does k-means work?

A

First, it randomly selects k of the objects in D, each of which initially represents a cluster mean
or center. For each of the remaining objects, an object is assigned to the cluster to which
it is the most similar, based on the Euclidean distance between the object and the cluster
mean. The k-means algorithm then iteratively improves the within-cluster variation.
For each cluster, it computes the new mean using the objects assigned to the cluster in
the previous iteration. All the objects are then reassigned using the updated means as
the new cluster centers. The iterations continue until the assignment is stable, that is,
the clusters formed in the current round are the same as those formed in the previous
round.

172
Q

K-means strengths

A

Relatively efficient - O(tkn) - where n = number of objects, k = number of clusters, and t = number of iterations respectively.

Relatively scalable and efficient in large datasets.

173
Q

Weaknesses of K-means

A
  1. Not guaranteed to converge to global optimum - often terminates at local optimum
  2. Applicable only when the means is defined (what about categorical data)
  3. Need to specify k in advance
  4. Unable to handle noisy data and outliers (mean is a non-resistant measure)
  5. Not suitable to discover clusters with non-convex shapes
174
Q

Precision

A

The % of retrieved documents that are in fact relevant to the query

(relevant + retrieved) / (retrieved)

175
Q

Recall

A

The percentage of documents that are relevant to the query and were, in fact, retrieved.

(relevant + retrieved) / (relevant)

176
Q

Major difficulties of keyword-based retrieval

A
  1. Synonymy - A keyword T does not appear anywhere in the document, even though the document is closely related to T. e.g. data mining
  2. Polysemy - The same keyword may mean different things in different contexts
177
Q

Types of Web Mining

A
  1. Web content mining
  2. Web structure mining
  3. Web usage mining.
178
Q

Challenges of Web Mining

A
  1. Abundance problem
  2. Limited Coverage of the web
  3. Limited query interface
  4. Limited customization
179
Q

HITS

A

Hyperlink-Induced Topic Search - explores interaction between hubs and authoritative pages

180
Q

Web Usage Mining

A
  1. Mining Web log records to discover user access patterns of Web pages
  • Target potential customers for e-commerce
  • Enhance the quality and delivery of Internet information services to the end user
  • Improve Web server system performance
  • Identify potential prime advertisement locations