DSPJ Flashcards

1
Q

CRISP-DM Framework:

A
  1. Business Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modelling
  5. Evaluation
  6. Deployment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Supervised Learning

A
  • Supervised learning is the machine learning task of inferring a function from labeled training. The training data consist of a set of training examples.
  • It makes prediction about values of data using known results found from historical data.
  • Modelling techniques:
    • Classification
    • Decision Tree
    • Logistic Regression
    • Neural Networks
    • Linear Regression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Unsupervised Learning

A
  • Unsupervised learning is used for sense-making and not for prediction.
  • It explores the properties of the data examined and identifies patterns or relationships in data.
  • Common techniques in unsupervised learning is clustering and association rule mining.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Common Feature Selection & Dimension Reduction Techniques:
* Correlation Analysis

A
  • Using correlation matrix, we can select features which are highly correlated with the target variable in linear regression.
  • If there is a high correlation coefficient between 2 independent variables, this indicates a possibility of redundancy and measurement of the same construct
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Common Feature Selection & Dimension Reduction Techniques:
* Multicollinearity Check

A
  • When 2 independent variables are highly correlated, the regression model will have multicollinearity issue. That is, the model may change erratically in response to small changes in the model or the data which will affect the calculation of coefficient estimates.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Common Feature Selection & Dimension Reduction Techniques:
* Wald Chi-Square

A
  • It is used to check the statistical significance of independent variables in the logistic regression results.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Common Feature Selection & Dimension Reduction Techniques:
* Factor Analysis

A
  • Factor analysis is a statistical technique to identify the unobserved latent variables or factors. Data should be normally distributed and variables should be linear in nature. To use this technique, the data must be quantitative and not qualitative. Generally, factor analysis is used to analyse survey data.
  • Quantitative data is data expressing a certain quantity, amount, or range.
  • Qualitative data is data describing the attributes or properties that an object possesses.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Common Feature Selection & Dimension Reduction Techniques:
* Principal Component Analysis

A
  • Principal component analysis (PCA) is a technique that is useful for the compression and classification of data. The purpose is to reduce the dimensionality of a data set (sample) by finding a new set of variables, smaller than the original set of variables, that nonetheless retains most of the sample’s information.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Common Feature Selection & Dimension Reduction Techniques:
* Latent Semantic Analysis

A
  • A natural language processing technique for dimension reduction
  • Find co-occurrences of words in a low-rank approximation (using a mathematical technique called Singular Value Decomposition)
  • LSA groups both documents that contain similar words, as well as words that occur in a similar set of documents. The resulting patterns are used to detect latent components. Words and documents that are semantically similar will be clustered around each other. (focus on similarity)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Common Feature Selection & Dimension Reduction Techniques:
* Linear Discriminant Analysis (LDA)

A
  • Often used for topic analysis. LDA finds a linear combination of features that characterizes or separates two or more classes of objects or events.
  • LDA is also closely related to principal component analysis (PCA) and factor analysis in that they both look for linear combinations of variables which best explain the data.
  • LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A good clustering method should produce high quality clusters with

A
  • high intra-class similarity
  • low inter-class similarity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

There are 2 types of clustering technique:

A
  • Partitioning-based clustering
  • Hierarchical-based clustering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Partitioning-based clustering

A
  • K-means clustering: Each cluster is represented by the centre of the cluster
  • K-medoids clustering: Each cluster is represented by one of the objects in the cluster
  • Preferable if efficiency is important, or data sets are very large
  • Given a new dataset, K-means should be tried first because it is often good enough
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Hierarchical clustering:

A
  • Agglomerative clustering(bottom up)
  • Divisive Clustering (top down)
  • Preferable for detailed data analysis
  • Provide more information than partitioning clustering
  • No single best algorithm
  • Less efficient than partitioning clustering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How to choose inputs for clustering?

A

In general, you should seek inputs that have these attributes:
* are meaningful to the analysis objective
* are relatively independent
* are limited in number
* have a measurement level of Interval
* have low kurtosis and skewness (at least in the training data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The performance of logistic regression can be gauged by the following measures:

A
  • ROC curve
  • Misclassification Rate
  • Accuracy rate
  • Confusion matrix
17
Q

Decision Tree Strengths

A
  • Able to generate understandable rules.
  • Perform classification without requiring much computation.
  • Able to handle both continuous and categorical variables.
  • Provide a clear indication of which fields are most important for prediction or classification.
18
Q

Decision Tree Weaknesses

A
  • Less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute.
  • Prone to errors in classification problems with many class and relatively small number of training examples.
  • Can be computationally expensive to train. At each node, each candidate splitting field must be sorted before its best split can be found. In some algorithms, combinations of fields are used and a search must be made for optimal combining weights.