Data mining Flashcards

1
Q

Describe the KDD-process. All five steps, and explain what is done for each individual step.

A
  1. Data selection
    - selecting approriate data from various sources.
  2. Pre-processing
    - cleaning, removing errors, outliers etc
  3. Transformation
    - transform data in to usable format for DM method
  4. Data mining
    - apply data mining methods
  5. Interpretation
    - use discovered knowledge to make informed decisions.

(iterative process)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Mention all steps in CRISP-DM – short explanation of steps.

A
  1. Business understanding
    - understand how ML may benefit the organisation’s goals.
  2. Data understanding
    - initial dataset established, studied to see if eligable for further processing. Otherwise, revert and work iteratively.
  3. Data preparation
    - prepare the data set by cleaning etc, preprocessing before modeling.
  4. Modeling
    - use the data to create models using ML-methods.
  5. Evaluation
    - if model is poor, reconsider whole project and start over. If positive, time to integrate with software system.
  6. Deployment
    - launch the model.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Big data and it’s 4 V’s. Explain each V briefly.

A
  1. Velocity
    - speed of which data is generated
  2. Volume
    - how much data
  3. Veracity
    - accuracy of data
  4. Variety
    - mix and types of data, format (structured, semi-structured etc)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe causality.

A

Causality is the relationship between cause and effect.
Good example is Christmas Holidays (A) => (B) more house fires, more patients with broken femurs. The causality is that B depends on A.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the apriori algorithm and what does it do?

A

Used in the realm of Association rule learning. Identifies most frequently occurring elements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the formula for Precision?

A

Precision = TP / (TP + FP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the formula for Recall?

A

Recall = TP / (TP + FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What may be considered of importance when using Recall and Precision – where would you prioritize?

A

Recall is used when trying to avoid False Negatives.

Precision is used when trying to avoid False Positives.

ex. Cancer diagnosis, prioritise Recall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How is Accuracy and Error calculated?

A

Accuracy = correct predictions / total amount of predictions.

Error = incorrect predictions / total amount of predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain what Regression is and what it is used for.

A

Regression is a supervised learning method and is a predictive method for predicting a numerical value.

Can be used for Risk analysis, House market pricing, Weather forecast etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain Clustering.

A

Clustering is a unsupervised learning method and is descriptive. It works by using centroids (k-means, kNN) to determine the closest distance between a data point to it’s centroid. Can be used to determine which data points are more similar to eachother.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Support, Confidence and Lift?

A

A part of Association Rule Mining. Support is used to determine the frequence of the item. Such as X / total amount of Y.

Confidence = a way of determining the association between item B’s occurrence when buying item A.

Lift = lift <1 bad rule, lift > 1 good rule.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is DB-Scan?

A

DB-scan stands for Density based scan. It groups data points in clusters defined by packed nearest neighbors. Robust to noise and outliers, doesn’t need us to determine a set amount of centroids, different to k-Means. A beneficial model to use when we don’t know how many centroids we need, as well as doesn’t need data points to be of same size and type. It adapts and groups datapoints iteratively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Association Rule Mining?

A

Used to evaluate the strength and significance of relationships between items in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Testing ML-based systems – explain which data is used and what they handle.

A

Testing is the process of trying to find errors within a model. The data used in testing is Training data (75%), Validation data (10%) and Test data (15%).

Training data is used to create the model and to train the model.

Validation data is used to fine-tune parameters for a more accurate model and to evaluate model.

Test data is kept separate and is later used on the model to evaluate performance after training and tuning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is mean imputation?

A

Mean imputation is filling in null-values in a dataset.

17
Q

What is supervised learning and which models are associated with it?

A

Supervised learning is a ML-method where we use labeled data. It is a predictive learning method, often numerical.

Methods used in Supervised learning is “Classification” and “Regression”.

18
Q

What is unsupervised learning and which methods are associated with it?

A

Unsupervised learning uses non labeled data.
It is a descriptive method.

Methods used would be clustering (kNN-algorithm, k-means, DB-scan), Association Rule Mining.