data_mining_flashcards Flashcards by Adrian Iglesias Cannegieter

Front

Back

How well did you know this?

Not at all

Perfectly

What is data mining?

The process of discovering patterns and extracting useful knowledge from large datasets.

How well did you know this?

Not at all

Perfectly

What are the two main types of learning?

Supervised: Labeled data, prediction tasks (Regression, Classification). Unsupervised: Unlabeled data, pattern discovery (Clustering, Association).

How well did you know this?

Not at all

Perfectly

Give examples of data mining applications

Healthcare, fraud detection, marketing, finance, recommendation systems.

How well did you know this?

Not at all

Perfectly

What are common techniques to handle missing data?

Drop rows/columns with missing data. Imputation (mean, median, mode).

How well did you know this?

Not at all

Perfectly

What is normalization vs. standardization?

Normalization: Rescales values to [0, 1]. Standardization: Centers data with mean = 0, std = 1.

How well did you know this?

Not at all

Perfectly

What is one-hot encoding used for?

To convert categorical features into binary columns.

How well did you know this?

Not at all

Perfectly

What is the purpose of EDA?

To explore, visualize, and summarize data to find patterns and outliers.

How well did you know this?

Not at all

Perfectly

What is a boxplot?

A visualization tool to show data spread, quartiles, and detect outliers.

How well did you know this?

Not at all

Perfectly

What does a correlation matrix tell us?

The relationships (positive/negative/none) between numerical variables.

How well did you know this?

Not at all

Perfectly

What is the formula for linear regression?

Y = b0 + b1X + ε.

How well did you know this?

Not at all

Perfectly

What is logistic regression used for?

To predict binary outcomes using the sigmoid function.

How well did you know this?

Not at all

Perfectly

What are common evaluation metrics for classification?

Accuracy, Precision, Recall, F1 Score, ROC-AUC.

How well did you know this?

Not at all

Perfectly

How does KNN work?

It classifies a point based on the majority class of its k-nearest neighbors.

How well did you know this?

Not at all

Perfectly

Name two distance metrics used in KNN.

Euclidean, Manhattan.

How well did you know this?

Not at all

Perfectly

What is the impact of increasing k in KNN?

Study These Flashcards

Larger k reduces overfitting but can oversmooth boundaries.

What is entropy in decision trees?

Study These Flashcards

A measure of impurity in a dataset.

What is pruning in decision trees?

Study These Flashcards

Reducing tree size to prevent overfitting.

What criteria are used for splitting in decision trees?

Study These Flashcards

Gini impurity, Entropy, Information Gain.

What is bias in machine learning?

Study These Flashcards

Error from overly simplistic models (underfitting).

What is variance in machine learning?

Study These Flashcards

Error from sensitivity to training data (overfitting).

What is PCA used for?

Study These Flashcards

To reduce dimensionality while retaining important features.

What are eigenvectors and eigenvalues?

Study These Flashcards

They define the principal components in PCA.

How does a random forest work?

Study These Flashcards

It uses bagging to create multiple decision trees and combines their outputs.

Why is random forest better than a single decision tree?

It reduces overfitting and improves accuracy.

What is the goal of SVM?

To find the optimal hyperplane that maximizes the margin between classes.

What are the assumptions of Naive Bayes?

Features are conditionally independent given the class.

What are the steps in KMeans clustering?

1. Initialize k cluster centers. 2. Assign points to the nearest cluster. 3. Update cluster centers.

How does GMM differ from KMeans?

GMM uses probability distributions, while KMeans uses centroids.

What is the Apriori algorithm used for?

To identify frequent itemsets in market basket analysis.

What are Support

Confidence

What are content-based and collaborative filtering?

Content-based: Uses item features. Collaborative: Uses user interactions.

What is survivorship bias?

Bias caused by focusing only on successful cases.

How is NLP used in data mining?

Text analysis for tasks like sentiment analysis and machine translation.

data_mining_flashcards

(34 cards)