Revision Flashcards

Question

What is spatio-temporal data?

Answer 1

Spatiotemporal data are data that relate to both space and time. Spatiotemporal data mining refers to the process of discovering patterns and knowledge from spatiotemporal data. Typical examples of spatiotemporal data mining include: - discovering the evolutionary history of cities and lands - uncovering weather patterns - predicting earthquakes and hurricanes - determining global warming trends

Answer 2

Moving-object data i.e. data about moving objects. E.g. telemetry equipment on wildlife to analyze ecologicla behaviour, mobility managers embed GPS monitors in cars to better monitor and guide vehicles, and meterologists use weather satellites and radars to observe hurricanes.

Answer 3

1. time series 2. symbolic sequence 3. biological sequences

Answer 4

A symbolic sequence consists of an ordered set of elements or events recorded with or without a concrete notion of time. E.g. customer shopping sequence data, web click streams, biological sequences, etc. Because biological sequence data carry very complicated and hidden semantic meaning and pose many challenging research issues, most investigations are conducted in the field of bioinformatics.

Answer 5

A symbolic sequence consists of an ordered set of elements or events recorded with or without a concrete notion of time. E.g. customer shopping sequence data, web click streams, biological sequences, etc. Because biological sequence data carry very complicated and hidden semantic meaning and pose many challenging research issues, most investigations are conducted in the field of bioinformatics. And so, sequential pattern mining has focused mostly on mining symbolic sequences. A sequential pattern is a frequent subsequence existing in a single sequence or a set of sequences. Mining of sequential patterns involves mining the set of subsequences that are frequent in one sequence or a set of sequences.

Answer 6

An interdisciplinary field that draws on Information Retrieval, Data Mining, Machine Learning, Statistics & Computational Linguistics.

Answer 7

To define high-quality information from text. High-quality usually refers to a combination of: - relevance - novelty - interestingness

Answer 8

IR is a field developed in parallel with database systems. Information is organised into (a large number) of documents. IR problem: locating relevant documents based on user input, such as keywords or example documents. Typical IR systems: 1. Online library catalogue systems 2. Online document management systems

Answer 9

Some DB problems are not present in IR. e.g. update, transaction management, complex objects, concurrency control, recovery. Some IR problems are not addressed well in DBMS. e.g. unstructured documents, approximate search using keywords and the notion of relevance.

Answer 10

Database systems include the creation, maintenance and use of databases for organizations and end-users. DB systems: - highly recognized principles in data models, query languages, query processing and optimization methods, data storage, indexing and accessing methods. - well-known for their scalability in processing very large, relatively structured data IR: - IR is the science of searching for documents or information in documents (documents can be text or multimedia, and may reside on the Web) The difference between IR and traditional DB systems are twofold: 1) IR assumes that the data under search are unstructured 2) in IR, the queries are formly mainly by keywords which do not have complex structures (unlike SQL queries in database systems)

Answer 11

Database systems include the creation, maintenance and use of databases for organizations and end-users. DB systems: - highly recognized principles in data models, query languages, query processing and optimization methods, data storage, indexing and accessing methods. - well-known for their scalability in processing very large, relatively structured data IR: - IR is the science of searching for documents or information in documents (documents can be text or multimedia, and may reside on the Web) The difference between IR and traditional DB systems are twofold: 1) IR assumes that the data under search are unstructured 2) in IR, the queries are formed mainly by keywords, which do not have complex structures (unlike SQL queries in DB systems).

Answer 12

A data warehouse integrates data from multiple sources and various timeframes. It consolidates data in multidimensional space to form partially materialized data cubes. The data cube model not only facilitates OLAP in multidimensional databases but also promotes multidimensional data mining.

Answer 13

The typical approaches in IR adopt probabilistic models. For example, a text document can be regarded as a bag of words, that is, a multiset of words appearing in the document. The document's language model is the probability density function that generates the bag of words in the document. The similarity between two documents can be measured by the similarity between their corresponding language models.

Answer 14

A topic in a set of text documents can be modelled as the probability distribution over the vocabulary. A text document, which may involve one or multiple models, can be regarded as a mixture of multiple topic models. By integrating IR models and DM techniques, we can find the major topics in a collection of documents, and for each document in the collection, the major topics involved.

Answer 15

Loosely speaking, a data warehouse refers to a data repository that is maintained separately from an organization's operational databases. - Data warehouse systems allow for integration of a variety of application systems. - They support information processing by providing a solid platform of consolidating historic data for analysis. - Data warehouses generalize and consolidate data in a multidimensional space

Answer 16

A data warehouse is a semantically consistent data store that serves as a physical implementation of a decision support data model. - provides online analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitates effective data generalization and data mining.

Answer 17

DWs provide the architecture and tools for business executives to systemtically organize, understand, and use their data to make strategic decisions.

Answer 18

It is a data repository architecture - a repository of multiple heterogenous data sources organized under a unified schema at a single site to facilitate management-decision making.

Answer 19

"A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process."

Answer 20

1. Subject-oriented 2. Integrated 3. Time-variant 4. Non-volatile

Answer 21

A data warehouse is organized around major subjects, e.g. customer, supplier, product, sales It excludes data that is not relevant to the decision-making process.

Answer 22

That data comes from multiple heterogenous sources, e.g. flat files, relational databases, online transaction records. These data usually require data cleaning and integration techniques.

Answer 23

That data are stored to provide information from a historical perspective, e.g. past 5-10 years. Also, data is typically summarized, e.g. organized per item type or per region.

Answer 24

A data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. All that's required for the data warehouses is the initial loading and access of data.

Answer 25

DBs. || Data Warehouses 1. OLTP vs OLAP 2. Day-to-day operations of an organization such as purchasing, inventory, manufacturing, banking, payroll, registration, accounting VS. serves users or knowledege workers in the role of data analysis and decision-making 3. Customer-oriented (used for transaction and query processing by clerks, clients, and IT professionals) VS. Market-oriented (used for data analysis by knowledge workers including managers, executives & analysts) 4. Current data (typically too detailed to be easily used for decision-making) VS. Historic data 5. Adopts ER model and application-oriented db design VS. Adopts a STAR or SNOWFLAKE model and a subject-oriented database design 6. Short, atomic transactions VS. Facilitation for summarization & aggregation and stores & manages info at different levels of granularity 7. Requires concurrency control and recovery mechanisms VS. Allowing read-only operations

Answer 26

Knowledge discovery from data

Answer 27

Analysis techniques with functionalities such as summarization, aggregation and consolidation, as well as the ability to view information from different angles.

Answer 28

- misnomer -> should have been more appropriately named 'knowledge mining from data' - synonym for KDD

Answer 29

1. Data cleaning 2. Data integration 3. Data selection 4. Data Transformation 5. Data mining 6. Pattern evaluation 7. Knowledge representation Data mining is just one step of the KDD process.

Answer 30

Collects all info about subjects spanning the entire organization.

Answer 31

Subset of corporate-wide data that is of value to a specific group of users. It's scope is confined to specific, selected groups, such as marketing data mart.

Answer 32

A set of views over operational databases -> requires excess capacity on databases Only some of the possible summary views may be materialized.

Answer 33

DM is the extraction of interesting (non-trivial, implicit, previously unknown, and potentially useful) information or patterns from large datasets.

Answer 34

The process of discovering interesting patterns and knowledge from large amounts of data.

Answer 35

1. (Deductive) query processing | 2. Expert systems or small ML/statistical programs

Answer 36

1. DBs 2. DW 3. Transaction data 4. Other kinds of data

Answer 37

Data cleaning, integration, transformation, loading and periodic data refreshing.

Answer 38

A data warehouse is usually modeled by a multidimensional data structure, called a data cube, in which each dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure such as count or sum(sales amount) A data cube provides a multidimensional view of data and allows the precomputation and fast access of summarized data.

Answer 39

OLAP. Online analytical processing operations make use of background knowledge regarding the domain of the data being studied to allow the presentation of data at different levels of abstraction. Such operations accommodate different user viewpoints

Answer 40

Database analysis & decision support > market analysis & management > risk analysis & management > fraud detection and management Other applications > Text mining (news group, email, documents) and Web analyses > Intelligent query answering

Answer 41

``` Besides relational database data, data warehouse data, and transaction data, there are many other kinds of data that have versatile forms and structures and rather different semantic meanings. These include: - time-related or sequence data - data streams - spatial data - engineering design data - hypertext or multimedia data - graph and network data - the Web ```

Answer 42

- Characterization & discrimination - frequent pattern mining, associations and correlations - classification & regression - clustering analysis - outlier analysis

Answer 43

- Patterns that occur frequently in the data Many kinds: - frequent itemsets - frequent subsequences (sequential patterns) - frequent substructures

Answer 44

A set of items that often appear together in a transactional dataset, e.g. milk and bread are frequently bought together in grocery stores by many customers

Answer 45

e.g the pattern that customers tend to purchase a laptop first, then a digital camera, then a memory card is a frequent sequential pattern.

Answer 46

A substructure can refer to different structural forms (e.g. graphs, trees, lattices) that may be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured pattern.

Answer 47

For example, (buys.X, “computer”) => (buys.X, “software”) [support = 1%, confidence = 50%] In this scenario, support refers to the fact that 1% of all transactions under analysis show that computer and software are purchased together.

Answer 48

For example, (buys.X, “computer”) => (buys.X, “software”) [support = 1%, confidence = 50%] In this case, a 50% confidence means that there is a 50% chance that when a customer purchases a computer, that they will also purchase software. Typically, association rules are regarded as uninteresting if do not satisfy minimum support threshold and a minimum confidence threshold.

Answer 49

A flow chart-like structure, where each node denotes a test on an attribute value, each branch represents an outcome of the test, and tree leaves represent classes or class distributions.

Answer 50

Based on the principle of maximizing the intraclass similarity and minimizing the inter-class similarity. Objects within a cluster are similar to each other, and are dissimilar to objects in other clusters.

Answer 51

Data sources: credit card transactions, loyalty cards, discount coupons, customer complaint calls, public lifestyle studies Target marketing: find clusters or model customers who share the same characteristics - Determine customer purchasing patterns over time - Cross-market analysis (associations/correlations between product sales, prediction based on the association information)

Answer 52

1. Customer profiling (what types of customers buy what products) 2. Identifying customer requirements 3. Providing summary information

Answer 53

1. Finance planning and asset evaluation 2. Resource planning 3. Competition

Answer 54

- Auto insurance - Money laundering - Medical insurance - Detecting inappropriate medical treatment - Detecting telephone fraud - Retail - Sports - Astronomy - Internet Web Surf-Aid

Answer 55

1. Easily understood by humans 2. Valid on new or test data with some degree of certainty 3. Potentially useful 4. Novel 5. Validates a hypothesis that the user sought to confim

Answer 56

1. Support | 2. Confidence

Answer 57

Represents the % of transactions from a transaction database that the given rule satisfies. P(X U Y) where (X U Y) is the union of the itemsets X and Y

Answer 58

Assesses the degree of certainty of the detected association. P(X|Y) or, the probability that a transaction containing X also contains Y.

Answer 59

Integrates data from multiple sources and various timeframes. It consolidates data in multi-dimensional to form partially materialized data cubes. The data cube model not only facilitates OLAP in multidimensional databases but also promotes multi-dimensional data mining.

Answer 60

The process of discovering interesting patterns from massive amounts of data. As KDD process, it typically involves data cleaning, data integration, data selection, data transformation, pattern discovery, pattern evaluation and knowledge presentation.

Answer 61

A form of data analysis that extracts models describing important data classes. Such models are called classifiers and they predict categorical (discrete, unordered) data.

Answer 62

- predicts categorical class labels - classified data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it to classify new data

Answer 63

Two-step process. 1. Model construction - learning phase 2. Model usage - classification step where the model is used to predict the class label of a given data

Answer 64

``` In the first step, a classifier is built describing a predetermined set of data classes or concepts. This is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing or “learning from” a training set made up of database tuples and their associated class labels. A tuple, X, is represented by an n-dimensional attribute vector, X = (x1, x2, : : : , xn), depicting n measurements made on the tuple from n database attributes, respectively, A1, A2, : : : , An. Each tuple, X, is assumed to belong to a predefined class as determined by another database attribute called the class label attribute. The class label attribute is discrete-valued and unordered. It is categorical (or nominal) in that each value serves as a category or class. The individual tuples making up the training set are referred to as training tuples and are randomly sampled from the database under analysis. In the context of classification, data tuples can be referred to as samples, examples, instances, data points, or objects. ```

Answer 65

The class label of each training tuple is provided. The learning of the classifier is 'supervised' in that it is told to which class each training tuple belongs.

Answer 66

The class label of each training tuple is not known. The number or set of classes to be learned may not be known in advance.

Answer 67

Learning of a mapping or function y=f(x) that can predict the associated class label y of a given tuple X.

Answer 68

In this step, the model is used for classification.

Answer 69

The learning of decision trees from class-labeled training tuples. A decision-tree is a flow-chart like structure where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label.

Answer 70

denotes a test on an attribute

Answer 71

denote an outcome of a test on an attribute

Answer 72

Test the attribute values of the sample against the decision tree. Given a tuple X, for which the associated class label is unknown, the attribute values of the tuple are tested against the decision tree. A path is traced from the root to a leaf node, which holds the class prediction for that tuple. Decision trees can be easily converted to classification rules.

Answer 73

1. Data cleaning = preprocess data to remove noise and handle outliers 2. Relevance analysis = feature selection/remove the irrelevant or redundant attributes 3. Data transformation = generalize or normalize data

Answer 74

1. Accuracy of the prediction 2. Speed and scalability = time to construct the model, time to use the model, efficiency in disk-resident databases 3. Robustness = handling noise and missing values 4. Interpretability = understanding and insight provided by the model

Answer 75

1. Does not require any domain knowledge or parameter setting - good for exploratory knowledge discovery 2. Can handle multidimensional data 3. Representation of human knowledge in tree form is intuitive and generally easy to assimilate by humans 4. Learning + classification steps of decision tree induction are simple and fast. 5. Generally, decision trees have very good accuracy.

Answer 76

1. Attribute Selection Measure - selects the attribute that best partitions the tuples into distinct classes 2. Tree pruning - attempts to remove branches that represent noise or outliers with the goal of improving classification accuracy on unseen data.

Answer 77

Most DT algorithms follow a top-down, recursive, divide-and-conquer approach. ``` Start off with a training set of tuples and their associated class labels. The training set is recursively partitioned into smaller subsets as the tree is being built. ```

Answer 78

1. Tree is constructed in a top-down recursive divide-and-conquer manner 2. At start, all the training examples are at the root 3. Attributes are categorical (if continuous-valued, they are discretized in advance) 4. Tuples are partitionned recursively based on selected attributes. 5. Test attributes are selected on the basis of a heuristic or statistical measure (e.g. IG) Stopping conditions for partioning: 1. All samples for a given node belong to the same class 2. There are no remaining attributes for further partitioning - majority voting 3. No training tuples left

Answer 79

Represent the knowledge in the form of IF-THEN rules - One rule is created for each path from the root to the leaf - Each attribute-value pair along a path forms a conjunction - The leaf node holds the class prediction - Rules are easier to understand for humans

Answer 80

Two approaches: 1. Pre-pruning = Halt tree construction early - do not split a node if this would result in the goodness measure falling below a threshold. 2. Post-pruning = Remove branches from a "fully-grown" tree - get a sequence of progressively pruned trees

Answer 81

1. Separate training (2/3) and testing (1/3) sets 2. Use cross-validation 3. Use all data for training 4. Use minimum description length principle

Answer 82

1. Relatively faster learning speed 2. Convertible to simple and easy to understand classification rules 3. Can use SQL queries for accessing databases 4. Comparable classification accuracy with other methods

Answer 83

“Although information gain is usually a good measure for deciding the relevance of an attribute, it is not perfect. A notable problem occurs when information gain is applied to attributes that can take on a large number of distinct values. For example, suppose that one is building a decision tree for some data describing the customers of a business. Information gain is often used to decide which of the attributes are the most relevant, so they can be tested near the root of the tree. One of the input attributes might be the customer's credit card number. This attribute has a high mutual information, because it uniquely identifies each customer, but we do not want to include it in the decision tree: deciding how to treat a customer based on their credit card number is unlikely to generalise to customers we haven't seen before (overfitting).”

Answer 84

1. Data partition 2. Attribute list 3. Attribute selection measure (e.g. IG)

Answer 85

Internal nodes denote a test on an attribute

Answer 86

Leaf nodes represent class labels

Answer 87

The outcome of a test on attribute

Answer 88

Tree pruning is used to identify and remove branches that represent noise or outliers. Its goal is to improve classification accuracy.

Answer 89

Hurricane data, environmental data, global warming data

Answer 90

A spatial DW is an integrated, subject-oriented, time-variant, and non-volatile spatial data repository for data analysis and decision-making.

Answer 91

1. Structure-specific formats (raster vs. vector-based, OO vs relational models, different storage and indexing, etc.). 2. Vendor-specific formats (e.g. ESRI, MapInfo, Inter-graph, etc)

Answer 92

1. MD Spatial database | 2. Both dimensions and measures may contain spatial components.

Answer 93

1. Online aggregation 2. Pre-processing 3. Selective materialization

Answer 94

Analyse spatial objects to derive classification schemes such as decision trees in relevance to certain spatial properties (e.g. district, highway, river)

Answer 95

1. Detect changes and trends along a spatial dimension 2. Study the trend of non-spatial or spatial data changing with space 3. Observe the trend of changes of the climate or vegetation with the increasing distance from the ocean.

Answer 96

> consists of sequences of values or events changing with time > data is recorded at regular intervals > characteristic time-series components (trends, cycles seasonal, random)

Answer 97

Can be illustrated as a time-series graph which describes a point moving over time

Answer 98

1. Long-term or trend movements 2. Cyclic movements or cycle variations e.g. business cycles 3. Seasonal movements or seasonal variations e.g almost identical patterns that a time series appears to follow during corresponding months of successive years. 4. Irregular or random movements (sporadic changes due to chance events e.g. labour disputes or personnel changes)

Answer 99

1. Freehand method - fit the curve by looking at the graph, costly and barely reliable for large-scale data mining 2. Least-square method - find the curve minimizing the sum of the squares of the deviation of points on the curve from the corresponding data points 3. Moving-average method - Eliminate cyclic, seasonal and irregular patterns, loss of end data, sensitive to outliers

Answer 100

- Mining of frequently occurring patterns related to time or other sequences - Usually concentrates on symbolic patterns e. g. Renting 'Star Wars' then 'Empire Strikes Back' then 'Return of the Jedi' in that order.

Answer 101

- Target marketing & customer retention | - Weather prediction

Answer 102

1. Duration of a time sequence T - sequential pattern mining can be confined to the data within a specified duration 2. Event-folding window W - if W=T, time-insensitive patterns are found, if W=0 then each event occurs at a distinct time instant, and if 0 < W < T, sequences occurring within the same period W are folded in the analysis. 3. Time interval, int, between events in the discovered pattern - int=0 no interval gap is allowed, i.e. only strictly consecutive sequences are found, if min_int <= int <= max_int then patterns are separated by at least min_int time and at most max_int time, and if int = c != 0 then patterns must occur at an exact interval.

Answer 103

A DW generalizes and consolidates data in an multidimensional space. > Provides OLAP tools for the interactive analysis of MD data of varied granularities.

Answer 104

A decision support database that is maintained separately from the organization's operational database. Supports information processing by providing a solid platform of consolidated, historical data.

Answer 105

"A data warehouse is a subject-oriented, time-variant, integrated and non-volatile collection of data in support of management's decision-making process."

Answer 106

The process of constructing and using data warehouses.

Answer 107

> organised around major subjects e.g. customers, sales, products > focuses on the modelling and analysis of data for decision makers, not on daily operations or transaction processing > provides a simple and concise view around particular subject by excluding data not relevant to decision-making process.

Answer 108

> Integrating multiple heterogenous data sources (relational databases, flat files, on-line transaction records) > Applying techniques of data cleaning and integration (ensure consistency in naming conventions, encoding structures, attributes measures among different data sources).

Answer 109

> The time horizon for a DWs is significantly longer than that of an operational system > While operational dbs contain current data, DW data is typically a historical perspective (e.g. past 5-10 years) > Every key structure in the DW contains an element of time, implicitly or explicitly

Answer 110

> Physically separate stores of data transformed from the operational environment > Operational update of data does not occur in the data warehouse environment > DW does not require transaction processing, recovery, or concurrency control mechanisms > Requires only 2 operations in data accessing: initial loading of data, and access of data >Read-only data

Answer 111

1. Traditional dbs integration by building mediators/wrappers on top of the heterogenous dbs, query driven approach 2. DWs are update-driven, high-performance, and the data from heterogeneous sources in integrated in advance and stored in warehouses for direct access and analysis

Answer 112

Online Transaction Processing

Answer 113

Online Analytical Processing

Answer 114

> OLTP is the major task of traditional relational DBs > Involves day-to-day operations e.g. purchasing, inventory, banking, manufacturing, payrolll, registration, accounting, etc. > OLAP is the major task of DWs > OLAP = data analysis and decision-making

Answer 115

1. Customer vs. Market-oriented 2. Current, detailed data vs. historical, consolidated 3. ER + application vs STAR + subject oriented design 4. Current, local vs. evolutionary, integrated view 5. Update vs. Read-only with complex queries

Answer 116

High performance for both systems. DBMS - tuned for OLTP (access methods, indexing, concurrency control, recovery) Warehouse - tuned for OLAP (complex OLAP queries, MD view, consolidation)

Answer 117

> Decision support requires historical data which operational DBs do not typically have > DW requires consolidation (agg, summarization) from various data sources > Data quality: data needs to be reconciled from different sources.

Answer 118

multidimensional data model which views data in the form of a data cube. A data cube allows data to be modelled and viewed in multiple dimensions e.g. sales

Answer 119

A fact table in the middle connected to a set of dimension tables

Answer 120

A refinement of star schema where some dimensional hierarchy is normalised into a set of smaller dimension tables, forming a shape similar to a snowflake

Answer 121

Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation. Fact constellations are typically used for data warehouses, as they can model multiple interrelated subjects

Answer 122

Star or Snowflake schema

Answer 123

1. Roll-up - summarize data by climbing up a concept hierarchy or by dimension reduction 2. Drill-down (roll-down) -> reverse of roll-up, from higher-level summary to lower level summary or detailed data, or introducing new dimensions 3. Slice and Dice - project and select 4. Pivot (rotate) - reorient cube, visualization, 3D to series of 2D planes 5. Drill-across - involves more than one fact table 6. Drill-through - through the bottom level of the cube to its back-end relational tables using SQL

Answer 124

1. Top-down 2. Data sources 3. Data warehouse 3. Business query

Answer 125

1. Top-down: Starts with overall design and planning (mature) 2. Bottom-up: Starts with experiments and prototypes (rapid)

Answer 126

Planning, data collection, DW design, testing and evaluation, DW deployment

Answer 127

1. Waterfall: structured and systematic analysis at each step before proceeding to the next 2. Spiral: Rapid generation of increasingly functional systems, short turn around time

Answer 128

1. Choose the business process you want to model e.g. orders, invoices 2. Choose the grain (atomic level of data) of the business process 3. Choose the dimensions that will apply to each fact table record 4. Choose the measures that will apply to each fact table record.

Answer 129

1. Relational OLAP 2. Multidimensional OLAP 3. Hybrid OLAP 4. Specialised SQL servers

Answer 130

- Attribute Selection Measure - Based on pioneering work of Claude Shannon on information theory, which studied the value of 'information content' of messages Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is chosen as the splitting attribute for node N. This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or “impurity” in these partitions. Such an approach minimizes the expected number of tests needed to classify a given tuple and guarantees that a simple (but not necessarily the simplest) tree is found. The expected information needed to classify a tuple in D is given by Info.D/ D 􀀀 Xm iD1 pi log2.pi/, where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is estimated by jCi,Dj/jDj. A log function to the base 2 is used, because the information is encoded in bits. Info(D) is just the average amount of information needed to identify the class label of a tuple in D. Note that, at this point, the information we have is based solely on the proportions of tuples of each class. Info(D) is also known as the entropy of D. Now, suppose we were to partition the tuples in D on some attribute A having v distinct values, fa1, a2, : : : , avg, as observed from the training data. If A is discrete-valued, these values correspond directly to the v outcomes of a test on A. Attribute A can be used to split D into v partitions or subsets, fD1, D2, : : : , Dvg, where Dj contains those tuples in D that have outcome aj of A. These partitions would correspond to the branches grown from node N. Ideally, we would like this partitioning to produce an exact classification of the tuples. That is, we would like for each partition to be pure. However, it is quite likely that the partitions will be impure (e.g., where a partition may contain a collection of tuples from different classes rather than from a single class). How much more information would we still need (after the partitioning) to arrive at an exact classification? This amount is measured by InfoA.D/ D Xv jD1 jDj j jDj Info.Dj/. The term jDj j jDj acts as the weight of the jth partition. InfoA.D/ is the expected information required to classify a tuple from D based on the partitioning by A. The smaller the expected information (still) required, the greater the purity of the partitions. Information gain is defined as the difference between the original information requirement (i.e., based on just the proportion of classes) and the new requirement (i.e., obtained after partitioning on A). That is, Gain.A/ D Info.D/􀀀InfoA.D/. (8.3) In other words, Gain(A) tells us how much would be gained by branching on A. It is the expected reduction in the information requirement caused by knowing the value of A. The attribute A with the highest information gain, Gain.A/, is chosen as the splitting attribute at nodeN. This is equivalent to saying that we want to partition on the attribute A that would do the “best classification,” so that the amount of information still required to finish classifying the tuples is minimal (i.e., minimum InfoA.D/).

Answer 131

The process of grouping a set of data objects into multiple groups or clusters so that objects within a cluster have high similarity but are very dissimilar to objects in other clusters. Dissimilarity and similarity are assessed based on the attribute values describing the objects and often involve distance measures.

Answer 132

Clustering - Learning my observations, rather than learning by examples

Answer 133

1. Scalability - working well on large datasets 2. Ability to deal with different types of attribute - not just numeric 3. Discovery of clusters with arbitrary shape 4. Requirements for domain knowledge to determine input parameters 5. Ability to deal with noisy data 6. Incremental clustering and insensitivity to input order 7. Capability of clustering high-dimensional data 8. Constraint-based clustering 9. Interpretability & usability

Answer 134

1. partitioning criteria - should be hierarchical or all clusters operate on same level 2. separation of clusters - should clusters be mutually exclusive 3. similiarity measure - determining the similarity between 2 objects by the distance between them e.g. absolute distance 4. clustering space - using full dataset may not be useful with high-dimensional data -> subspace clustering

Answer 135

Given a set of n objects, a partitioning method constructs k partitions of the data, where each partition represents a cluster and k n. That is, it divides the data into k groups such that each group must contain at least one object. In other words, partitioning methods conduct one-level partitioning on data sets. The basic partitioning methods typically adopt exclusive cluster separation. That is, each object must belong to exactly one group.

Answer 136

- kmeans - kmedoids - work well for finding spherical shaped-clusters in small-to-medium databases

Answer 137

- creates a hierarchical decomposition of the given set of data objects - agglomerative or divisive based on how the decomposition is performed - agglomerative approach is bottom-up => starts with one object per cluster and successively merges the objects or groups closeby until all groups are merged into one. - divisive approach is top-down - starts with all the objects in the same cluster and it is iteratively split into smaller clusters until eventually each object is in one cluster - hierarchical clustering can be distance-, density- or continuity-based - suffers from fact that once a step is done (merge, split), it cannot be undone

Answer 138

Their general idea is to continue growing a given cluster as long as the density (number of objects or data points) in the “neighborhood” exceeds some threshold. For example, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points. Such a method can be used to filter out noise or outliers and discover clusters of arbitrary shape.

Answer 139

Quantize the object space into a finite number of cells that form a grid structure.

Answer 140

– Find mutually exclusive clusters of spherical shape – Distance-based – May use mean or medoid (etc.) to represent cluster center – Effective for small- to medium-size data sets

Answer 141

– Clustering is a hierarchical decomposition (i.e., multiple levels) – Cannot correct erroneous merges or splits – May incorporate other techniques like microclustering or consider object “linkages”

Answer 142

– Can find arbitrarily shaped clusters – Clusters are dense regions of objects in space that are separated by low-density regions – Cluster density: Each point must have a minimum number of points within its “neighborhood” – May filter out outliers

Answer 143

– Use a multiresolution grid data structure – Fast processing time (typically independent of the number of data objects, yet dependent on grid size)

Answer 144

First, it randomly selects k of the objects in D, each of which initially represents a cluster mean or center. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the Euclidean distance between the object and the cluster mean. The k-means algorithm then iteratively improves the within-cluster variation. For each cluster, it computes the new mean using the objects assigned to the cluster in the previous iteration. All the objects are then reassigned using the updated means as the new cluster centers. The iterations continue until the assignment is stable, that is, the clusters formed in the current round are the same as those formed in the previous round.

Answer 145

Relatively efficient - O(tkn) - where n = number of objects, k = number of clusters, and t = number of iterations respectively. Relatively scalable and efficient in large datasets.

Answer 146

1. Not guaranteed to converge to global optimum - often terminates at local optimum 2. Applicable only when the means is defined (what about categorical data) 3. Need to specify k in advance 4. Unable to handle noisy data and outliers (mean is a non-resistant measure) 5. Not suitable to discover clusters with non-convex shapes

Answer 147

The % of retrieved documents that are in fact relevant to the query (relevant + retrieved) / (retrieved)

Answer 148

The percentage of documents that are relevant to the query and were, in fact, retrieved. (relevant + retrieved) / (relevant)

Answer 149

1. Synonymy - A keyword T does not appear anywhere in the document, even though the document is closely related to T. e.g. data mining 2. Polysemy - The same keyword may mean different things in different contexts

Answer 150

1. Web content mining 2. Web structure mining 3. Web usage mining.

Answer 151

1. Abundance problem 2. Limited Coverage of the web 3. Limited query interface 4. Limited customization

Answer 152

Hyperlink-Induced Topic Search - explores interaction between hubs and authoritative pages

Answer 153

1. Mining Web log records to discover user access patterns of Web pages - Target potential customers for e-commerce - Enhance the quality and delivery of Internet information services to the end user - Improve Web server system performance - Identify potential prime advertisement locations

Revision Flashcards

(180 cards)