Data mining Flashcards

Question 1

Q

Describe the KDD-process. All five steps, and explain what is done for each individual step.

Answer

A

Data selection
- selecting approriate data from various sources.
Pre-processing
- cleaning, removing errors, outliers etc
Transformation
- transform data in to usable format for DM method
Data mining
- apply data mining methods
Interpretation
- use discovered knowledge to make informed decisions.

(iterative process)

Question 2

Q

Mention all steps in CRISP-DM – short explanation of steps.

Answer

A

Business understanding
- understand how ML may benefit the organisation’s goals.
Data understanding
- initial dataset established, studied to see if eligable for further processing. Otherwise, revert and work iteratively.
Data preparation
- prepare the data set by cleaning etc, preprocessing before modeling.
Modeling
- use the data to create models using ML-methods.
Evaluation
- if model is poor, reconsider whole project and start over. If positive, time to integrate with software system.
Deployment
- launch the model.

Question 3

Q

Big data and it’s 4 V’s. Explain each V briefly.

Answer

A

Velocity
- speed of which data is generated
Volume
- how much data
Veracity
- accuracy of data
Variety
- mix and types of data, format (structured, semi-structured etc)

Question 4

Q

Describe causality.

Answer

A

Causality is the relationship between cause and effect.
Good example is Christmas Holidays (A) => (B) more house fires, more patients with broken femurs. The causality is that B depends on A.

Question 5

Q

What is the apriori algorithm and what does it do?

Answer

A

Used in the realm of Association rule learning. Identifies most frequently occurring elements.

Question 6

Q

What is the formula for Precision?

Answer

A

Precision = TP / (TP + FP)

Question 7

Q

What is the formula for Recall?

Answer

A

Recall = TP / (TP + FN)

Question 8

Q

What may be considered of importance when using Recall and Precision – where would you prioritize?

Answer

A

Recall is used when trying to avoid False Negatives.

Precision is used when trying to avoid False Positives.

ex. Cancer diagnosis, prioritise Recall.

Question 9

Q

How is Accuracy and Error calculated?

Answer

A

Accuracy = correct predictions / total amount of predictions.

Error = incorrect predictions / total amount of predictions.

Question 10

Q

Explain what Regression is and what it is used for.

Answer

A

Regression is a supervised learning method and is a predictive method for predicting a numerical value.

Can be used for Risk analysis, House market pricing, Weather forecast etc.

Question 11

Q

Explain Clustering.

Answer

A

Clustering is a unsupervised learning method and is descriptive. It works by using centroids (k-means, kNN) to determine the closest distance between a data point to it’s centroid. Can be used to determine which data points are more similar to eachother.

Question 12

Q

What is Support, Confidence and Lift?

Answer

A

A part of Association Rule Mining. Support is used to determine the frequence of the item. Such as X / total amount of Y.

Confidence = a way of determining the association between item B’s occurrence when buying item A.

Lift = lift <1 bad rule, lift > 1 good rule.

Question 13

Q

What is DB-Scan?

Answer

A

DB-scan stands for Density based scan. It groups data points in clusters defined by packed nearest neighbors. Robust to noise and outliers, doesn’t need us to determine a set amount of centroids, different to k-Means. A beneficial model to use when we don’t know how many centroids we need, as well as doesn’t need data points to be of same size and type. It adapts and groups datapoints iteratively.

Question 14

Q

What is Association Rule Mining?

Answer

A

Used to evaluate the strength and significance of relationships between items in a dataset.

Question 15

Q

Testing ML-based systems – explain which data is used and what they handle.

Answer

A

Testing is the process of trying to find errors within a model. The data used in testing is Training data (75%), Validation data (10%) and Test data (15%).

Training data is used to create the model and to train the model.

Validation data is used to fine-tune parameters for a more accurate model and to evaluate model.

Test data is kept separate and is later used on the model to evaluate performance after training and tuning.

Question 16

Q

What is mean imputation?

Answer

Study These Flashcards

A

Mean imputation is filling in null-values in a dataset.

Question 17

Q

What is supervised learning and which models are associated with it?

Answer

Study These Flashcards

A

Supervised learning is a ML-method where we use labeled data. It is a predictive learning method, often numerical.

Methods used in Supervised learning is “Classification” and “Regression”.

Question 18

Q

What is unsupervised learning and which methods are associated with it?

Answer

Study These Flashcards

A

Unsupervised learning uses non labeled data.
It is a descriptive method.

Methods used would be clustering (kNN-algorithm, k-means, DB-scan), Association Rule Mining.

Data mining Flashcards

(18 cards)