Data mining Flashcards

1
Q

What is data mining?

A. Extracting minerals from the earth
B. Extracting useful information from large datasets
C. Creating new databases
D. None of the above

A

B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which of the following is not a data mining task?

A. Classification
B. Clustering
C. Sorting
D. Association rule mining

A

C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which technique is used for dimensionality reduction in data mining?

A. Principal Component Analysis (PCA)
B. Linear Regression
C. Support Vector Machines (SVM)
D. K-Means Clustering

A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is association rule mining?

A. Finding patterns where one event leads to another
B. Predicting future stock prices
C. Classifying data into multiple classes
D. None of the above

A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which algorithm is used for frequent itemset generation in association rule mining?

A. Apriori
B. Decision Tree
C. k-Nearest Neighbors (k-NN)
D. Naive Bayes

A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In clustering, which method requires predefining the number of clusters?

A. K-Means
B. Hierarchical Clustering
C. DBSCAN
D. Mean-Shift

A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the acronym “CRISP-DM” stand for in data mining methodology?

A. Comprehensive Regression for Intelligent Statistical Prediction in Data Mining
B. Cross-Industry Standard Process for Data Mining
C. Critical Review of Innovative Statistical Processes in Data Mining
D. None of the above

A

B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the primary goal of regression analysis in data mining?

A. Predicting categorical values
B. Predicting continuous values
C. Finding association rules
D. Classifying data

A

B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Which data mining technique is used for anomaly detection?

A. Classification
B. Clustering
C. Outlier detection
D. Association rule mining

A

C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which evaluation metric is not used for assessing the performance of classification
models?

A. Accuracy
B. Mean Squared Error (MSE)
C. Precision
D. Recall

A

B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Which algorithm is a supervised learning method used for classification?

A. K-Means
B. Apriori
C. Decision Tree
D. PCA

A

C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the purpose of cross-validation in machine learning?

A. To divide data into training and testing sets
B. To reduce overfitting in models
C. To validate results using an independent dataset
D. To evaluate a model’s performance

A

B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which technique is used to handle missing values in a dataset?

A. Mean/Median imputation
B. Dropping rows with missing values
C. Using the mode value
D. All of the above

A

D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

In which phase of the data mining process are patterns and insights discovered?

A. Data cleaning
B. Data exploration
C. Pattern evaluation
D. Data modeling

A

C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which algorithm is a popular ensemble learning method?

A. Random Forest
B. K-Means
C. Linear Regression
D. Gradient Descent

A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which algorithm is a popular ensemble learning method?

A. Random Forest
B. K-Means
C. Linear Regression
D. Gradient Descent

A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the term “overfitting” refer to in machine learning?

A. Model performs well on unseen data
B. Model learns noise and irrelevant details from the training data
C. Model has too few parameters
D. None of the above

A

B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Which algorithm is used for anomaly detection in time series data?

A. DBSCAN
B. ARIMA
C. K-Nearest Neighbors (KNN)
D. AdaBoost

A

B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Which technique is used to handle imbalanced datasets in classification?

A. Oversampling
B. Undersampling
C. SMOTE (Synthetic Minority Over-sampling Technique)
D. All of the above

A

D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Which type of data mining task involves assigning predefined categories to items?

A. Classification
B. Clustering
C. Association rule mining
D. Regression

A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Which method is used to measure the similarity between two data points in
clustering?

A. Euclidean distance
B. Manhattan distance
C. Cosine similarity
D. All of the above

A

D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Which algorithm is a type of unsupervised learning?

A. Decision Tree
B. K-Means
C. Support Vector Machine (SVM)
D. Random Forest

A

B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Which technique is used for reducing the dimensionality of data while preserving
its structure?

A. Principal Component Analysis (PCA)
B. Singular Value Decomposition (SVD)
C. Linear Discriminant Analysis (LDA)
D. All of the above

A

D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Which algorithm is used for collaborative filtering in recommendation systems?

A. Apriori
B. K-Means
C. Singular Value Decomposition (SVD)
D. Decision Tree

A

C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Which evaluation metric is used for regression models?

A. Accuracy
B. F1 Score
C. Mean Absolute Error (MAE)
D. Precision

A

C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Which technique is used for reducing the noise in data?

A. Outlier detection
B. Normalization
C. Feature selection
D. Smoothing

A

D

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Which algorithm is sensitive to the initialization of centroids?

A. K-Means
B. Decision Tree
C. Random Forest
D. AdaBoost

A

A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Which data mining task is used for discovering hidden patterns in large datasets?

A. Classification
B. Clustering
C. Regression
D. Association rule mining

A

B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Which technique is used for text mining to represent words as numerical vectors?

A. One-Hot Encoding
B. Bag-of-Words
C. TF-IDF
D. All of the above

A

D

28
Q

Which algorithm uses gradient descent for optimization in training?

A. Logistic Regression
B. K-Means
C. Decision Tree
D. Naive Bayes

A

A

29
Q

Which technique is used for selecting the most important features in a dataset?

A. Principal Component Analysis (PCA)
B. Recursive Feature Elimination (RFE)
C. Lasso Regression
D. All of the above

A

D

30
Q

Which method is used for handling class imbalance in classification problems by
giving higher weights to minority class samples?

A. Random Oversampling
B. Cost-sensitive learning
C. SMOTE (Synthetic Minority Over-sampling Technique)
D. All of the above

A

D

31
Q

Which algorithm is used for time series forecasting?

A. K-Means
B. Random Forest
C. ARIMA (Autoregressive Integrated Moving Average)
D. Support Vector Machine (SVM)

A

C

32
Q

Which evaluation metric balances precision and recall?

A. F1 Score
B. Accuracy
C. Mean Absolute Error (MAE)
D. R Squared

A

A

33
Q

Which technique is used for reducing the number of features by combining them
into new features?

A. Feature engineering
B. Principal Component Analysis (PCA)
C. Feature selection
D. All of the above

A

A

34
Q

Which algorithm is not a type of neural network?

A. Convolutional Neural Network (CNN)
B. Recurrent Neural Network (RNN)
C. Decision Tree
D. Multilayer Perceptron (MLP)

A

C

35
Q

Which technique is used to address the curse of dimensionality?

A. Dimensionality reduction
B. Feature scaling
C. Oversampling
D. All of the above

A

A

36
Q

Which method is used to split a dataset into training and testing subsets?

A. Holdout method
B. Cross-validation
C. Bootstrapping
D. All of the above

A

D

37
Q

Which algorithm is used for sentiment analysis in natural language processing?

A. K-Means
B. Decision Tree
C. Naive Bayes
D. Support Vector Machine (SVM)

A

C

38
Q

Which technique is used to handle categorical variables in machine learning
models?

A. Label Encoding
B. One-Hot Encoding
C. Ordinal Encoding
D. All of the above

A

D

39
Q

Which algorithm is used for reducing the variance in models by combining multiple
weak learners?

A. K-Means
B. Decision Tree
C. AdaBoost
D. Random Forest

A

D

40
Q

Which technique is used to avoid overfitting in neural networks?

A. Dropout
B. Batch Normalization
C. Weight regularization
D. All of the above

A

D

41
Q

Which evaluation metric is used for ranking models in information retrieval
systems?

A. Accuracy
B. Precision at K
C. F1 Score
D. R Squared

A

B

42
Q

Which algorithm is used for sequence prediction in time series data?

A. Apriori
B. Long Short-Term Memory (LSTM)
C. K-Means
D. Principal Component Analysis (PCA)

A

B

43
Q

Which technique is used to handle outliers in a dataset?

A. Removing outliers
B. Transformation
C. Imputation
D. All of the above

A

D

44
Q

Which algorithm is used for hyperparameter tuning in machine learning models?

A. Grid Search
B. K-Means
C. PCA
D. AdaBoost

A

A

45
Q

Which technique is used for reducing the variance in models by training on multiple
subsets of the data?

A. Bootstrap aggregating (Bagging)
B. Boosting
C. Stacking
D. All of the above

A

A

46
Q

Which method is used for handling outliers by capping/extending the extreme
values?

A. Winsorization
B. Z-score normalization
C. Min-Max scaling
D. All of the above

A

A

47
Q

Which algorithm is used for natural language processing tasks such as language
translation?

A. Long Short-Term Memory (LSTM)
B. K-Means
C. Support Vector Machine (SVM)
D. Random Forest

A

A

48
Q

Which technique is used for imputing missing values based on similar instances in
the dataset?

A. Mean/Median imputation
B. K-Nearest Neighbors (KNN) imputation
C. Mode imputation
D. All of the above

A

B

49
Q

Which algorithm is used for optimizing non-linear functions in machine learning?

A. Linear Regression
B. Gradient Descent
C. K-Means
D. Naive Bayes

A

B

50
Q

Which technique is used for detecting and handling multicollinearity in regression
models?

A. Variance Inflation Factor (VIF)
B. Principal Component Analysis (PCA)
C. Feature scaling
D. All of the above

A

A

51
Q

Which method is used for combining predictions from multiple models by assigning
weights?

A. Bagging
B. Boosting
C. Stacking
D. All of the above

A

C

52
Q

Which algorithm is used for reducing the dimensionality of text data?

A. TF-IDF
B. Singular Value Decomposition (SVD)
C. Word2Vec
D. All of the above

A

D

53
Q

Which technique is used for handling time-series data with trend and seasonality?

A. ARIMA
B. Exponential Smoothing
C. Holt-Winters Method
D. All of the above

A

D

54
Q

Which algorithm is used for anomaly detection in a network intrusion detection
system?

A. K-Means
B. Support Vector Machine (SVM)
C. Isolation Forest
D. All of the above

A

C

55
Q

Which method is used for assessing the importance of variables in a decision tree?

A. Information Gain
B. Gini Impurity
C. Entropy
D. All of the above

A

D

56
Q

Which technique is used for handling skewed distributions in data?

A. Log transformation
B. Square transformation
C. Cube transformation
D. All of the above

A

A

57
Q

Which algorithm is used for time series data forecasting based on past
observations and trend?

A. Linear Regression
B. Moving Average
C. ARIMA
D. All of the above

A

C

58
Q

Which technique is used for reducing the impact of outliers in regression models?

A. Winsorization
B. Z-score normalization
C. Min-Max scaling
D. All of the above

A

A

59
Q

Which algorithm is used for reducing the number of dimensions while preserving
most of the variance in the data?

A. Principal Component Analysis (PCA)
B. Support Vector Machine (SVM)
C. Random Forest
D. All of the above

A

A

60
Q

What is the primary objective of ‘association rule mining’ in data mining?

A. Predicting numeric values
B. Finding interesting relationships between variables
C. Classifying data into predefined categories
D. Generating decision trees

A

B

61
Q

Which data mining task involves finding patterns that describe groups within the
data?

A. Classification
B. Clustering
C. Regression
D. Association

A

B

62
Q

What is the purpose of the Apriori algorithm in data mining?

A. To predict future outcomes
B. To classify data
C. To perform clustering
D. To find frequent item sets in a transaction database

A

D

63
Q

In the context of data mining, what is ‘dimensionality reduction’?

A. Process of increasing the number of attributes in a dataset
B. Process of transforming data into a higher-dimensional space
C. Process of decreasing the number of attributes in a dataset
D. Process of removing outliers from a dataset

A

C

64
Q

Which technique in data mining is used to impute missing values in a dataset?

A. Decision Trees
B. Regression Analysis
C. Clustering
D. Association Rule Mining

A

B

65
Q

What does the term ‘outlier detection’ refer to in data mining?

A. Identifying patterns in a dataset
B. Removing noisy data from a dataset
C. Finding unusual or rare data points that deviate from the norm
D. Evaluating the performance of a model

A

C

66
Q

Which algorithm is commonly used for text mining and natural language processing
tasks?

A. K-nearest neighbors (KNN)
B. Naive Bayes
C. Hierarchical clustering
D. Support Vector Machine (SVM)

A

B

67
Q

What is the main goal of the K-means clustering algorithm?

A. Maximizing intra-cluster similarity
B. Minimizing the number of clusters
C. Minimizing intra-cluster similarity
D. Maximizing the number of iterations

A

A

68
Q

What technique is used in data mining to reduce the noise in data?

A. Outlier detection
B. Smoothing
C. Sampling
D. Dimensionality reduction

A

B

69
Q

What does the term ‘ROC curve’ represent in data mining?

A. A graphical representation of a classifier’s performance
B. A method for reducing overfitting in models
C. A technique for feature selection
D. A clustering evaluation metric

A

A