DSPJ Flashcards

Question 1

Q

CRISP-DM Framework:

Answer

A

Business Understanding
Data Understanding
Data Preparation
Modelling
Evaluation
Deployment

Question 2

Q

Supervised Learning

Answer

A

Supervised learning is the machine learning task of inferring a function from labeled training. The training data consist of a set of training examples.
It makes prediction about values of data using known results found from historical data.
Modelling techniques:
- Classification
- Decision Tree
- Logistic Regression
- Neural Networks
- Linear Regression

Question 3

Q

Unsupervised Learning

Answer

A

Unsupervised learning is used for sense-making and not for prediction.
It explores the properties of the data examined and identifies patterns or relationships in data.
Common techniques in unsupervised learning is clustering and association rule mining.

Question 4

Q

Common Feature Selection & Dimension Reduction Techniques:
* Correlation Analysis

Answer

A

Using correlation matrix, we can select features which are highly correlated with the target variable in linear regression.
If there is a high correlation coefficient between 2 independent variables, this indicates a possibility of redundancy and measurement of the same construct

Question 5

Q

Common Feature Selection & Dimension Reduction Techniques:
* Multicollinearity Check

Answer

A

When 2 independent variables are highly correlated, the regression model will have multicollinearity issue. That is, the model may change erratically in response to small changes in the model or the data which will affect the calculation of coefficient estimates.

Question 6

Q

Common Feature Selection & Dimension Reduction Techniques:
* Wald Chi-Square

Answer

A

It is used to check the statistical significance of independent variables in the logistic regression results.

Question 7

Q

Common Feature Selection & Dimension Reduction Techniques:
* Factor Analysis

Answer

A

Factor analysis is a statistical technique to identify the unobserved latent variables or factors. Data should be normally distributed and variables should be linear in nature. To use this technique, the data must be quantitative and not qualitative. Generally, factor analysis is used to analyse survey data.
Quantitative data is data expressing a certain quantity, amount, or range.
Qualitative data is data describing the attributes or properties that an object possesses.

Question 8

Q

Common Feature Selection & Dimension Reduction Techniques:
* Principal Component Analysis

Answer

A

Principal component analysis (PCA) is a technique that is useful for the compression and classification of data. The purpose is to reduce the dimensionality of a data set (sample) by finding a new set of variables, smaller than the original set of variables, that nonetheless retains most of the sample’s information.

Question 9

Q

Common Feature Selection & Dimension Reduction Techniques:
* Latent Semantic Analysis

Answer

A

A natural language processing technique for dimension reduction
Find co-occurrences of words in a low-rank approximation (using a mathematical technique called Singular Value Decomposition)
LSA groups both documents that contain similar words, as well as words that occur in a similar set of documents. The resulting patterns are used to detect latent components. Words and documents that are semantically similar will be clustered around each other. (focus on similarity)

Question 10

Q

Common Feature Selection & Dimension Reduction Techniques:
* Linear Discriminant Analysis (LDA)

Answer

A

Often used for topic analysis. LDA finds a linear combination of features that characterizes or separates two or more classes of objects or events.
LDA is also closely related to principal component analysis (PCA) and factor analysis in that they both look for linear combinations of variables which best explain the data.
LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities.

Question 11

Q

A good clustering method should produce high quality clusters with

Answer

A

high intra-class similarity
low inter-class similarity

Question 12

Q

There are 2 types of clustering technique:

Answer

A

Partitioning-based clustering
Hierarchical-based clustering

Question 13

Q

Partitioning-based clustering

Answer

A

K-means clustering: Each cluster is represented by the centre of the cluster
K-medoids clustering: Each cluster is represented by one of the objects in the cluster
Preferable if efficiency is important, or data sets are very large
Given a new dataset, K-means should be tried first because it is often good enough

Question 14

Q

Hierarchical clustering:

Answer

A

Agglomerative clustering(bottom up)
Divisive Clustering (top down)
Preferable for detailed data analysis
Provide more information than partitioning clustering
No single best algorithm
Less efficient than partitioning clustering

Question 15

Q

How to choose inputs for clustering?

Answer

A

In general, you should seek inputs that have these attributes:
* are meaningful to the analysis objective
* are relatively independent
* are limited in number
* have a measurement level of Interval
* have low kurtosis and skewness (at least in the training data)

Question 16

Q

The performance of logistic regression can be gauged by the following measures:

Answer

Study These Flashcards

A

ROC curve
Misclassification Rate
Accuracy rate
Confusion matrix

Question 17

Q

Decision Tree Strengths

Answer

Study These Flashcards

A

Able to generate understandable rules.
Perform classification without requiring much computation.
Able to handle both continuous and categorical variables.
Provide a clear indication of which fields are most important for prediction or classification.

Question 18

Q

Decision Tree Weaknesses

Answer

Study These Flashcards

A

Less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute.
Prone to errors in classification problems with many class and relatively small number of training examples.
Can be computationally expensive to train. At each node, each candidate splitting field must be sorted before its best split can be found. In some algorithms, combinations of fields are used and a search must be made for optimal combining weights.

DSPJ Flashcards

(18 cards)