Data Mining Flashcards

Question 1

Q

Anti-monotonicity

Answer

A

When an item set violates the constraint, so does any of its supersets

Question 2

Q

Monotoncity

Answer

A

When an item set satisfies the conditions, its superset does so too

Question 3

Q

Classification

Answer

A

Supervised learning

Question 4

Q

Clustering

Answer

A

Unsupervised learning

Question 5

Q

Apriori

Answer

A

Algorithm for frequent item set mining and association rule learning over transactional databases.

Identifies frequent individual items in the database and extends them to larger and larger item sets as kind as those item sets appear sufficiently often.

Question 6

Q

Naïve Bayesian Classifier

Answer

A

Assumption that features are strongly independent of one another.

Question 7

Q

K-means clustering

Answer

A

Partitioning method

Each cluster is represented by the centre if the cluster

Question 8

Q

Partitioning method

Answer

A

Constructing a partition of a database D of n objects into a set of k clusters such that sum of squared distance is minimised

Question 9

Q

K-means steps

Answer

A

Partition objects into k non-empty subsets (initial seed point arbitrarily chosen)
Compute seed points as the centroids of the clusters of the current partition
Assign each object to the cluster with the nearest seed point
Repeat Step 2; stop when no more new assignments

Question 10

Q

PAM

Answer

A

Partitioning Around Medoids

Question 11

Q

Partitioning Around Medoids

Answer

A

Medoids - Representative objects

Starts from an initial set of medoids and iteratively replaces one of them by one of the non-medoids if it improves the total distance if the resulting clustering
Effective for small data sets

Question 12

Q

MOLAP

Answer

A

Multidimensional OLAP

Question 13

Q

ROLAP

Answer

A

Relational OLAP

Question 14

Q

Multidimensional OLAP

Answer

A

The MOLAP storage mode causes the aggregations of the partition and a copy of its source data to be stored in a multidimensional structure in Analysis Services when the partition is processed. This MOLAP structure is highly optimised to maximise query performance. The storage location can be on the computer where the partition is defined or on another computer running Analysis Services. Because a copy of the source data resides in the multidimensional structure, queries can be resolved without accessing the partition’s source data. Query response times can be decreased substantially by using aggregations. The data in the partition’s MOLAP structure is only as current as the most recent processing of the partition.

Question 15

Q

Relational OLAP

Answer

A

The ROLAP storage mode causes the aggregations of the partition to be stored in indexed views in the relational database that was specified in the partition’s data source. Unlike the MOLAP storage mode, ROLAP does not cause a copy of the source data to be stored in the Analysis Services data folders. Instead, when results cannot be derived from the query cache, the indexed views in the data source is accessed to answer queries. Query response is generally slower with ROLAP storage than with the MOLAP or HOLAP storage modes. Processing time is also typically slower with ROLAP. However, ROLAP enables users to view data in real time and can save storage space when you are working with large datasets that are infrequently queried, such as purely historical data.

Question 16

Q

DBSCAN

Answer

Study These Flashcards

A

Density Based Spatial Clustering of Applications with Noise

Question 17

Q

Data Warehouse

Answer

Study These Flashcards

A

A decision support database that is maintained separately from the organisation’s operational database
Supports information processing by providing a solid platform of consolidated, historical data for analysis

Question 18

Q

SVM Inefficiencies

Answer

Study These Flashcards

A

dual representation of data and model during training and prediction
iterative nature of approximation

Question 19

Q

DBSCAN Advantages

Answer

Study These Flashcards

A

can find arbitrarily-shaped clusters
requires just 2 parameters: Epsilon & minPts (min. no. of points to form a dense region)
does not require no. of clusters to be specified a priori

Question 20

Q

Neural Network Advantages & Disadvantages

Answer

Study These Flashcards

A

+ High tolerance to noisy data
+ Ability to classify untrained patterns
+ Well-suited for continuous-valued inputs and outputs
+ Successful on a wide array of real-world data

Long training time
Subjective
Weak mathematical foundation: Difficult to interpret the symbolic meaning behind the learned weights and of
“hidden units” in the network

Question 21

Q

Data warehousing steps

Answer

Study These Flashcards

A

Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

Data Mining Flashcards

(21 cards)