DSPJ Flashcards
CRISP-DM Framework:
- Business Understanding
- Data Understanding
- Data Preparation
- Modelling
- Evaluation
- Deployment
Supervised Learning
- Supervised learning is the machine learning task of inferring a function from labeled training. The training data consist of a set of training examples.
- It makes prediction about values of data using known results found from historical data.
- Modelling techniques:
- Classification
- Decision Tree
- Logistic Regression
- Neural Networks
- Linear Regression
Unsupervised Learning
- Unsupervised learning is used for sense-making and not for prediction.
- It explores the properties of the data examined and identifies patterns or relationships in data.
- Common techniques in unsupervised learning is clustering and association rule mining.
Common Feature Selection & Dimension Reduction Techniques:
* Correlation Analysis
- Using correlation matrix, we can select features which are highly correlated with the target variable in linear regression.
- If there is a high correlation coefficient between 2 independent variables, this indicates a possibility of redundancy and measurement of the same construct
Common Feature Selection & Dimension Reduction Techniques:
* Multicollinearity Check
- When 2 independent variables are highly correlated, the regression model will have multicollinearity issue. That is, the model may change erratically in response to small changes in the model or the data which will affect the calculation of coefficient estimates.
Common Feature Selection & Dimension Reduction Techniques:
* Wald Chi-Square
- It is used to check the statistical significance of independent variables in the logistic regression results.
Common Feature Selection & Dimension Reduction Techniques:
* Factor Analysis
- Factor analysis is a statistical technique to identify the unobserved latent variables or factors. Data should be normally distributed and variables should be linear in nature. To use this technique, the data must be quantitative and not qualitative. Generally, factor analysis is used to analyse survey data.
- Quantitative data is data expressing a certain quantity, amount, or range.
- Qualitative data is data describing the attributes or properties that an object possesses.
Common Feature Selection & Dimension Reduction Techniques:
* Principal Component Analysis
- Principal component analysis (PCA) is a technique that is useful for the compression and classification of data. The purpose is to reduce the dimensionality of a data set (sample) by finding a new set of variables, smaller than the original set of variables, that nonetheless retains most of the sample’s information.
Common Feature Selection & Dimension Reduction Techniques:
* Latent Semantic Analysis
- A natural language processing technique for dimension reduction
- Find co-occurrences of words in a low-rank approximation (using a mathematical technique called Singular Value Decomposition)
- LSA groups both documents that contain similar words, as well as words that occur in a similar set of documents. The resulting patterns are used to detect latent components. Words and documents that are semantically similar will be clustered around each other. (focus on similarity)
Common Feature Selection & Dimension Reduction Techniques:
* Linear Discriminant Analysis (LDA)
- Often used for topic analysis. LDA finds a linear combination of features that characterizes or separates two or more classes of objects or events.
- LDA is also closely related to principal component analysis (PCA) and factor analysis in that they both look for linear combinations of variables which best explain the data.
- LDA explicitly attempts to model the difference between the classes of data. PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities.
A good clustering method should produce high quality clusters with
- high intra-class similarity
- low inter-class similarity
There are 2 types of clustering technique:
- Partitioning-based clustering
- Hierarchical-based clustering
Partitioning-based clustering
- K-means clustering: Each cluster is represented by the centre of the cluster
- K-medoids clustering: Each cluster is represented by one of the objects in the cluster
- Preferable if efficiency is important, or data sets are very large
- Given a new dataset, K-means should be tried first because it is often good enough
Hierarchical clustering:
- Agglomerative clustering(bottom up)
- Divisive Clustering (top down)
- Preferable for detailed data analysis
- Provide more information than partitioning clustering
- No single best algorithm
- Less efficient than partitioning clustering
How to choose inputs for clustering?
In general, you should seek inputs that have these attributes:
* are meaningful to the analysis objective
* are relatively independent
* are limited in number
* have a measurement level of Interval
* have low kurtosis and skewness (at least in the training data)