Data Mining Flashcards

0
Q

Anti-monotonicity

A

When an item set violates the constraint, so does any of its supersets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
1
Q

Monotoncity

A

When an item set satisfies the conditions, its superset does so too

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Classification

A

Supervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Clustering

A

Unsupervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Apriori

A

Algorithm for frequent item set mining and association rule learning over transactional databases.

Identifies frequent individual items in the database and extends them to larger and larger item sets as kind as those item sets appear sufficiently often.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Naïve Bayesian Classifier

A

Assumption that features are strongly independent of one another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

K-means clustering

A

Partitioning method

Each cluster is represented by the centre if the cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Partitioning method

A

Constructing a partition of a database D of n objects into a set of k clusters such that sum of squared distance is minimised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

K-means steps

A
  1. Partition objects into k non-empty subsets (initial seed point arbitrarily chosen)
  2. Compute seed points as the centroids of the clusters of the current partition
  3. Assign each object to the cluster with the nearest seed point
  4. Repeat Step 2; stop when no more new assignments
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

PAM

A

Partitioning Around Medoids

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Partitioning Around Medoids

A

Medoids - Representative objects

  1. Starts from an initial set of medoids and iteratively replaces one of them by one of the non-medoids if it improves the total distance if the resulting clustering
  2. Effective for small data sets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

MOLAP

A

Multidimensional OLAP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

ROLAP

A

Relational OLAP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Multidimensional OLAP

A

The MOLAP storage mode causes the aggregations of the partition and a copy of its source data to be stored in a multidimensional structure in Analysis Services when the partition is processed. This MOLAP structure is highly optimised to maximise query performance. The storage location can be on the computer where the partition is defined or on another computer running Analysis Services. Because a copy of the source data resides in the multidimensional structure, queries can be resolved without accessing the partition’s source data. Query response times can be decreased substantially by using aggregations. The data in the partition’s MOLAP structure is only as current as the most recent processing of the partition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Relational OLAP

A

The ROLAP storage mode causes the aggregations of the partition to be stored in indexed views in the relational database that was specified in the partition’s data source. Unlike the MOLAP storage mode, ROLAP does not cause a copy of the source data to be stored in the Analysis Services data folders. Instead, when results cannot be derived from the query cache, the indexed views in the data source is accessed to answer queries. Query response is generally slower with ROLAP storage than with the MOLAP or HOLAP storage modes. Processing time is also typically slower with ROLAP. However, ROLAP enables users to view data in real time and can save storage space when you are working with large datasets that are infrequently queried, such as purely historical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DBSCAN

A

Density Based Spatial Clustering of Applications with Noise

16
Q

Data Warehouse

A
  • A decision support database that is maintained separately from the organisation’s operational database
  • Supports information processing by providing a solid platform of consolidated, historical data for analysis
17
Q

SVM Inefficiencies

A
  • dual representation of data and model during training and prediction
  • iterative nature of approximation
18
Q

DBSCAN Advantages

A
  • can find arbitrarily-shaped clusters
  • requires just 2 parameters: Epsilon & minPts (min. no. of points to form a dense region)
  • does not require no. of clusters to be specified a priori
19
Q

Neural Network Advantages & Disadvantages

A

+ High tolerance to noisy data
+ Ability to classify untrained patterns
+ Well-suited for continuous-valued inputs and outputs
+ Successful on a wide array of real-world data

  • Long training time
  • Subjective
  • Weak mathematical foundation: Difficult to interpret the symbolic meaning behind the learned weights and of
    “hidden units” in the network
20
Q

Data warehousing steps

A

Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse