data_mining_flashcards

1
Q

Front

A

Back

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data mining?

A

The process of discovering patterns and extracting useful knowledge from large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the two main types of learning?

A

Supervised: Labeled data, prediction tasks (Regression, Classification). Unsupervised: Unlabeled data, pattern discovery (Clustering, Association).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Give examples of data mining applications

A

Healthcare, fraud detection, marketing, finance, recommendation systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are common techniques to handle missing data?

A

Drop rows/columns with missing data. Imputation (mean, median, mode).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is normalization vs. standardization?

A

Normalization: Rescales values to [0, 1]. Standardization: Centers data with mean = 0, std = 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is one-hot encoding used for?

A

To convert categorical features into binary columns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the purpose of EDA?

A

To explore, visualize, and summarize data to find patterns and outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a boxplot?

A

A visualization tool to show data spread, quartiles, and detect outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does a correlation matrix tell us?

A

The relationships (positive/negative/none) between numerical variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the formula for linear regression?

A

Y = b0 + b1X + ε.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is logistic regression used for?

A

To predict binary outcomes using the sigmoid function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are common evaluation metrics for classification?

A

Accuracy, Precision, Recall, F1 Score, ROC-AUC.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does KNN work?

A

It classifies a point based on the majority class of its k-nearest neighbors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Name two distance metrics used in KNN.

A

Euclidean, Manhattan.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the impact of increasing k in KNN?

A

Larger k reduces overfitting but can oversmooth boundaries.

17
Q

What is entropy in decision trees?

A

A measure of impurity in a dataset.

18
Q

What is pruning in decision trees?

A

Reducing tree size to prevent overfitting.

19
Q

What criteria are used for splitting in decision trees?

A

Gini impurity, Entropy, Information Gain.

20
Q

What is bias in machine learning?

A

Error from overly simplistic models (underfitting).

21
Q

What is variance in machine learning?

A

Error from sensitivity to training data (overfitting).

22
Q

What is PCA used for?

A

To reduce dimensionality while retaining important features.

23
Q

What are eigenvectors and eigenvalues?

A

They define the principal components in PCA.

24
Q

How does a random forest work?

A

It uses bagging to create multiple decision trees and combines their outputs.

25
Q

Why is random forest better than a single decision tree?

A

It reduces overfitting and improves accuracy.

26
Q

What is the goal of SVM?

A

To find the optimal hyperplane that maximizes the margin between classes.

27
Q

What are the assumptions of Naive Bayes?

A

Features are conditionally independent given the class.

28
Q

What are the steps in KMeans clustering?

A
  1. Initialize k cluster centers. 2. Assign points to the nearest cluster. 3. Update cluster centers.
29
Q

How does GMM differ from KMeans?

A

GMM uses probability distributions, while KMeans uses centroids.

30
Q

What is the Apriori algorithm used for?

A

To identify frequent itemsets in market basket analysis.

31
Q

What are Support

A

Confidence

32
Q

What are content-based and collaborative filtering?

A

Content-based: Uses item features. Collaborative: Uses user interactions.

33
Q

What is survivorship bias?

A

Bias caused by focusing only on successful cases.

34
Q

How is NLP used in data mining?

A

Text analysis for tasks like sentiment analysis and machine translation.