Data Mining Flashcards

Question 1

Q

Data Mining

Answer

A

Data mining is the process of discovering patterns, correlations, trends and useful information from large sets of data using statistical, mathematical, and computational techniques

can work for efficiency optimization, quality control, predictive maintenance, supply chain management, customer insights, cost reduction …

Question 2

Q

ML

Answer

A

Supervised learning: Here is the data set where the right answers are given for each example. Please produce more right answers
Classification

Unsupervised learning: “Here is the unlabelled data. Please find peculiarities, similarities or structures (e.g. clusters) in the data yourself
Associated Rule learning
Clustering

Reinforcement Learning: Learn to do something by maximising your reward

Question 3

Q

Reinforcement Learning: Learn to do something by maximising your reward

Answer

A

to classify objects, events, or observations into predefined labels or classes.
Involve training a model on a dataset with known labels, and then using that model to predict the labels of unobserved data

Question 4

Q

KNN - K nearest neighbours

Answer

A

Aims to predict the correct class for the test data by calculating the distance between the test data and all training points

How to find optimal K value?
No predefined statistical methods to find the most favourable value of K
Choose a range of K values and then train the KNN with these K values
Evaluate the performance of KNN with different K values.

Question 5

Q

Naïve Bayes

Answer

A

a simple technique for constructingclassifiers. These classifiers assign class labels to problem instances based on feature values
calculates the probability of a hypothesis (class label) given the observed data and prior knowledge.
Naïve Bayes treats all word orders the same.
0 Frequency problem

Question 6

Q

Main Methods- measuring accuracy

Answer

A

There are many other algorithms for classification

Measuring the accuracy of a model – Use the test set data (not data that was used to train the model) – Accuracy is simply the number of data records correctly classified by the mode

Question 7

Q

Performance of Classification

Answer

A

A Confusion Matrix is N x N matrix used for evaluating the performance of a classification model
N is the total number of classes
Accuracy measures the ability of the classifier making the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions
Recall measures how many of the actual positive samples were correctly identified by the classifier.
Precision measures how many of the samples predicted as positive are actually positive.
TP: True positive

Question 8

Q

TP, FP

Answer

A

Precision = TP/TP+FP
Recall = TP/TP+FN
Accuracy= TP+TN/TP+TN+FP+FN

Question 9

Q

Associated Rule Learning

Answer

A

It is a rule-based machine learning method for discovering interesting relations between variables in large datasets.

Aims to find patterns from items that are dependent on one another and map the connections between them

Useful in predicting behaviours and relationships between variables in a dataset

Useful for classification and discovering patterns within data.

Question 10

Q

lIFT

Answer

A

Lift > 1 means if A is purchased, the probability of selling B increases 1 times more.
Lift =1 means the probability of occurrence of A and B is independent on each other.
Lift < 1 means A is a substitute for B

In data mining and market basket analysis, lift is a measure used to quantify the importance of a rule for predicting the occurrence of an item (or set of items) in a transaction, relative to the occurrence of the item(s) without the rule. It helps to determine the strength of association between items in a dataset.

The lift of an association rule is calculated as the ratio of the observed support of the rule to the expected support of the rule if the items were independent. Mathematically, it can be represented as:

[ \text{lift}(X \rightarrow Y) = \frac{\text{support}(X \cup Y)}{\text{support}(X) \times \text{support}(Y)} ]

Where:
Support(A and B)/ (A) *(B)

Interpretation of lift:
- If the lift is greater than 1, it indicates that the presence of ( X ) in a transaction increases the likelihood of the presence of ( Y ) in the same transaction, suggesting a positive association between ( X ) and ( Y ).
- If the lift is less than 1, it suggests that the presence of ( X ) in a transaction decreases the likelihood of the presence of ( Y ) in the same transaction, indicating a negative association between ( X ) and ( Y ).
- If the lift is equal to 1, it means that the occurrence of ( X ) and ( Y ) is independent, and there is no association between them.

Lift is a useful metric in market basket analysis for identifying meaningful associations between items and can help in making decisions related to product placement, cross-selling, and marketing strategies.

Question 11

Q

How do we find an optimal rule among many rules

Answer

A

Association Rule Learning can be divided into three algorithms:
Apriori algorithm
Eclat algorithm
FP-growth algorithm

The Apriori algorithm uses frequent datasets to generate association rules.
Involves finding sets of items or itemsets. An itemset can be a collection of one or more items that appear together in a transaction or dataset.
An itemset can be either a single item, also known as a 1-itemset, or a set of k items, known as a k-itemset

Let’s consider a simplified example of market basket analysis using the Apriori algorithm. Suppose we have a transaction dataset representing purchases at a grocery store:

Transaction 1: {bread, milk, eggs}
Transaction 2: {bread, butter}
Transaction 3: {milk, eggs}
Transaction 4: {bread, milk, butter}
Transaction 5: {bread, eggs}

We want to use the Apriori algorithm to identify frequent item sets and association rules from this dataset.

Step 1: Identify frequent single items:
- Support threshold = 2 (minimum support count)

Frequent single items:
- {bread}, {milk}, {eggs}, {butter}

Step 2: Generate candidate item sets of length 2:
- Join {bread} with {milk}, {eggs}, {butter}
- Join {milk} with {eggs}

Candidate item sets:
- {bread, milk}, {bread, eggs}, {bread, butter}, {milk, eggs}

Step 3: Prune candidate item sets:
- Prune {bread, butter} because {butter} is not a frequent single item.

Pruned candidate item sets:
- {bread, milk}, {bread, eggs}, {milk, eggs}

Step 4: Count support for candidate item sets:
- Count support for {bread, milk}: 2
- Count support for {bread, eggs}: 2
- Count support for {milk, eggs}: 2

Step 5: Filter candidate item sets:
- Filter {bread, milk}, {bread, eggs}, {milk, eggs} because their support counts meet the minimum support threshold.

Frequent item sets:
- {bread, milk}, {bread, eggs}, {milk, eggs}

Step 6: Repeat the process to generate higher-order frequent item sets if necessary. In this example, we stop at frequent item sets of length 2.

Step 7: Generate association rules from frequent item sets:
- {bread, milk} => {eggs}
- {bread, eggs} => {milk}
- {milk, eggs} => {bread}

These association rules indicate that customers who purchase bread and milk are likely to also purchase eggs, and similarly for other combinations.

This example demonstrates how the Apriori algorithm can be used to extract meaningful insights from transaction data by identifying frequent item sets and association rules.

Question 12

Q

Clustering

Answer

A

Aim:
To group data into classes or “clusters”
Helps in identifying patterns or structures within unlabelled data.
Simplifies complex datasets and makes them easier to analyse and interpret by grouping similar items.
Helps in decision-making by segmenting markets, customers, or products based on similarities.

Question 13

Q

K-means Clustering

Answer

A

A K-means clustering algorithm tries to group similar items in the form of clusters

The number of groups is represented by K

Select K (here we set K as 2).
2. Randomly generate 2 seeds as centroids (due to k=2).
3. Calculate the distance between each point to the 2 centroids and then sign each point to its closed centroid.
4. Calculate mean of each cluster and use them to replace the original centroids
5. Repeat step 3 and step 4 many times and until we have reached a clustering solution where we can no longer adjust any of clusters

Question 14

Q

Choosing K

Answer

A

The Elbow Method: calculate the WithinCluster-Sum of Squared Errors (WSS) for different values of K, and choose the K for which WSS becomes first starts to diminish

The Silhouette Method: measure how similar a point is to its own cluster compared to other cluster

Question 15

Q

Graph Clustering

Answer

A

A type of unsupervised learning
About partitioning nodes in a graph into clusters based on their common characteristics.
Nodes assigned into the same cluster have more properties in common with each other than nodes in different clusters.
Why:
Essential in identifying inherent structures or patterns within complex datasets, especially for dataset that is best represented as a network or graph, such as social networks, biological networks, or transportation systems.
Good for simplifying complex networks: Large networks can be overwhelming and difficult to analyze. Graph clustering simplifies these networks by grouping similar nodes together, making the network easier to understand and analyze.
In social network analysis, graph clustering helps identify communities or groups of individuals with common interests or connections. This can be crucial for marketing, sociology studies, and understanding social dynamics

Question 16

Q

Edge Betweeness

Answer

Study These Flashcards

A

Number of shortest paths between pairs of nodes that run through it

Calculate the betweenness of all existing edges in the network

Remove the edge(s) with the highest betweenness, because the edges with the highest betweenness connect communities.

Recalculate the betweenness of all remaining edges.

Repeat step 2 and step 3 and stop this process at a condition. This condition could be a desired number of clusters.

Edge betweenness is a measure used in network analysis to quantify the importance or centrality of edges within a network. It calculates the number of shortest paths between pairs of nodes in the network that pass through a particular edge. Here’s how it works:

Shortest Paths: First, calculate the shortest paths between all pairs of nodes in the network using algorithms like Dijkstra’s algorithm or Floyd-Warshall algorithm.
Counting Paths: For each edge (e) in the network, count the number of shortest paths that traverse (e). This involves examining all pairs of nodes and determining if the shortest path between them includes edge (e).
Edge Betweenness Calculation: The edge betweenness for edge (e) is calculated as the sum of the fraction of all shortest paths that pass through (e) over all pairs of nodes. Mathematically, it can be represented as:[ E(e) = \sum_{s \neq t \in V} \frac{\sigma_{st}(e)}{\sigma_{st}} ]Where:
- ( V ) is the set of nodes in the network.
- ( \sigma_{st} ) is the total number of shortest paths between node (s) and node (t).
- ( \sigma_{st}(e) ) is the number of shortest paths between node (s) and node (t) that pass through edge (e).
Normalization: Optionally, the edge betweenness values can be normalized by dividing each value by the total number of pairs of nodes in the network, ( |V|(|V|-1)/2 ), to make the values comparable across networks of different sizes.
Interpretation: Higher edge betweenness values indicate that the corresponding edge is more central to the network’s connectivity. These edges often serve as bridges between different parts of the network and play a crucial role in information flow, communication, or resource transfer within the network.

Edge betweenness is commonly used in network analysis tasks such as identifying critical edges, detecting community structures, or finding key pathways in transportation or communication networks.

Data Mining Flashcards

(16 cards)