data_mining_flashcards

1
Q

Front

A

Back

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data mining?

A

The process of discovering patterns and extracting useful knowledge from large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the two main types of learning?

A

Supervised: Labeled data, prediction tasks (Regression, Classification). Unsupervised: Unlabeled data, pattern discovery (Clustering, Association).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Give examples of data mining applications

A

Healthcare, fraud detection, marketing, finance, recommendation systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are common techniques to handle missing data?

A

Drop rows/columns with missing data. Imputation (mean, median, mode).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is normalization vs. standardization?

A

Normalization: Rescales values to [0, 1]. Standardization: Centers data with mean = 0, std = 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is one-hot encoding used for?

A

To convert categorical features into binary columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the purpose of EDA?

A

To explore, visualize, and summarize data to find patterns and outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a boxplot?

A

A visualization tool to show data spread, quartiles, and detect outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does a correlation matrix tell us?

A

The relationships (positive/negative/none) between numerical variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the formula for linear regression?

A

Y = b0 + b1X + ε.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is logistic regression used for?

A

To predict binary outcomes using the sigmoid function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are common evaluation metrics for classification?

A

Accuracy, Precision, Recall, F1 Score, ROC-AUC.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does KNN work?

A

It classifies a point based on the majority class of its k-nearest neighbors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Name two distance metrics used in KNN.

A

Euclidean, Manhattan.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the impact of increasing k in KNN?

A

Larger k reduces overfitting but can oversmooth boundaries.

17
Q

What is entropy in decision trees?

A

A measure of impurity in a dataset.

18
Q

What is pruning in decision trees?

A

Reducing tree size to prevent overfitting.

19
Q

What criteria are used for splitting in decision trees?

A

Gini impurity, Entropy, Information Gain.

20
Q

What is bias in machine learning?

A

Error from overly simplistic models (underfitting).

21
Q

What is variance in machine learning?

A

Error from sensitivity to training data (overfitting).

22
Q

What is PCA used for?

A

To reduce dimensionality while retaining important features.

23
Q

What are eigenvectors and eigenvalues?

A

They define the principal components in PCA.

24
Q

How does a random forest work?

A

It uses bagging to create multiple decision trees and combines their outputs.

25
Why is random forest better than a single decision tree?
It reduces overfitting and improves accuracy.
26
What is the goal of SVM?
To find the optimal hyperplane that maximizes the margin between classes.
27
What are the assumptions of Naive Bayes?
Features are conditionally independent given the class.
28
What are the steps in KMeans clustering?
1. Initialize k cluster centers. 2. Assign points to the nearest cluster. 3. Update cluster centers.
29
How does GMM differ from KMeans?
GMM uses probability distributions, while KMeans uses centroids.
30
What is the Apriori algorithm used for?
To identify frequent itemsets in market basket analysis.
31
What are Support
Confidence
32
What are content-based and collaborative filtering?
Content-based: Uses item features. Collaborative: Uses user interactions.
33
What is survivorship bias?
Bias caused by focusing only on successful cases.
34
How is NLP used in data mining?
Text analysis for tasks like sentiment analysis and machine translation.