LU1- Recap Flashcards
What are the four types of Data Analytics?
Descriptive
Diagnostic
Predictive
Prescriptive
Name 5 statistical techniques for Data Analysis
Linear regression
Classification
Resampling methods
Tree based methods
Unsupervised learning
What is Linear Regression?
Linear Regression is the technique that is used to predict a target variable by providing the best linear relationship among the dependent and independent
variable where best fit indicates the sum of all the distances amidst the shape and
actual observations at each data point is as minimum as achievable.
What is classification
Classification allocates specific
categories to a collection of data for making more spesific predictions and analysis.
Name two types of classification techniques
Logistic Regression
Discriminant Analysis
What is Logistic Regression?
A regression analysis technique to perform when the dependent variable is binary. It is a predictive analysis that is utilized for explaining data and the connection amongst one dependent variable and other nominal independent variables.
Name two resampling techniques
Bootstrapping
Cross- Validation
What is bootstrapping?
It operates through sampling with replacement from the actual
data and accounts the “not selected” data points as test samples.
What is Cross-Validation?
This technique is used in order to validate the model performance, and
can be executed by dividing the training data into K parts. During cross validation execution, the K-1 part can be considered as training and the rest made out part acts
as a test set. Up to K times, the process is repeated and then the average of K scores is
accepted as performance estimation
When does Undersampling take place?
When the majority of the class is copied
When does oversampling take place?
When the minority of the class gets copied
Name 3 Unsupervised learning algorithms
Principal component Analysis
K-Means Clustering
Hierarchical Clustering
What is Principal component analysis?
recognising a linear-set of the mutually uncorrelated blend of features having maximum variance. Also, it helps in acquiring latent interaction among the
variables in an unsupervised framework.
What is Machine Learning?
Machine Learning is the adoption of mathematical and or statistical models in order to get customized knowledge about data for making foresight.
Name an unsupervised machine-learning technique
Clustering
What are Latent variable models?
Latent variable models are commonly used for data preprocessing, such as reducing the number of features in a dataset (dimensionality reduction) or decomposing the dataset into multiple components.
What is Clustering
A clustering problem is where you want to discover the inherent groupings in the
data, such as grouping customers by purchasing behavior.
what type of learning( supervised/ unsupervised) makes use of clustering?
unsupervised
what type of learning( supervised/ unsupervised) makes use of Classification?
Supervised
what type of learning( supervised/ unsupervised) makes use of Regression?
Supervised
What type of machine learning type makes use of labelled input and output data during the training phase of the machine learning lifecycle
Supervised learning
To be able to classify new and unseen datasets and predict outcomes, what does a supervised learning model need to learn
relationship between input and output data
What machine learning is where we have input variables (X) and an output variable (Y)
Supervised learning
why do we call it supervised learning?
Because part of the approach requires human oversight
what is classification?
Classification is used when the output variable is categorical
Give an example of categorical data
yes/no or male/female or true/false
what is regression
regression is used when the output variable is a real or continuous value
give an example of regression variables
salary based on work experience or weight based on height
give an example of an algorithm used for regression problem
Linear regression or support vector regression or regression tree
give an example of a classification problem
The machine needs to understand the difference between stuff (apple, banana& cherry)
what type of learning does not make use of output variables
unsupervised
Unsupervised learning makes use of output variables (true or false)
False
what do you call unsupervised learning output
pseudo output
what is anomaly detection
It is when machine learning automatically detects unusual data points in a dataset
what is association mining
Identifies sets of items that frequently occur together in your dataset
what is latent variable models
Commonly used for data preprocessing such as reducing the number of features in a dataset
Give a real world example of unsupervised learning
computer vision - object recognition
medical imaging
anomaly detection
customer personas- habits
recommendation engines
what is the difference between unsupervised vs supervised
unsupervised - no output data is given
data is not labeled
computationally complex
less accurate & trustworthy
number of classes is not known
what is accuracy in regards to supervised learning
the ability of a model to make correct predictions
what is interpretability in regards to supervised learning
what degree the model allows for human understanding
give an example of an interpretable model
linear regression
Random forest
Give an example of a Non interpretable model
SVM(Support vector machine)
LSTM(Long short term memory)
Deep learning(DL)
What is K-means clustering
A method of vector quantization, aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean
what type of learning category does K- means clustering fall into
Unsupervised
why use k- means clustering
K-means is used to classify unlabelled data by grouping them by features rather than categories. the goal is to split the data into k different clusters and report the location of the centre of mass for each cluster
what does the K represent in K means cluster
the K represents the number of groups or categories created.
what is hierarchical clustering
algorithm that creates clusters that have predominant ordering from top to bottom
what does hierarchical clustering do?
Hierarchical clustering separates data into groups based on some measure of similarity
what is Agglomerative Hierarchical Clustering
(“bottom-up”) clustering starts with each
observation being its own cluster. They merge into subgroups as we move up the tree.
what is divisive clustering
(“top-down”) clustering starts with one cluster of
all observations. The cluster is split into subgroups as we
move down the tree.