Intro Flashcards
Machine learning?
Machine learning is programming a computer to optimize criterion using sample data or past experience.
Machine learning is??
“A computer program is said to learn from experience E w.r.t some task T and some performance evaluation measures P, if its performance P on a task T improves with experience E.”
Suppose your email program watches which email you do or do not mark as spam, and based on that learns how to better filter spam. What is task T in this setting?
1) Classifying email as spam or not.
2) Watching you label email as spam or not.
3) The number of fractions correctly marked as spam or not.
Before ML, applications:-
1) Data Cleaning
2) Data Pre-processing
3) Data transformation/Normalization
4) Data Integration
5) Multidimensional data issues
Data Cleaning
Data cleansing or data cleaning is the process of detecting and correcting corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
Data Pre-processing
Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing.
examples:Data preprocessing is used database-driven applications such as customer relationship management and rule-based applications (like neural networks).
Data transformation/Normalization
Data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration and data management tasks such as data wrangling, data warehousing, data integration and application integration.
Data Integration
Data integration is the process of combining data from different sources into a single, unified view. Integration begins with the ingestion process, and includes steps such as cleansing, ETL mapping, and transformation.
Multidimensional data issues
In statistics, econometrics, and related fields, multidimensional analysis (MDA) is a data analysis process that groups data into two categories: data dimensions and measurements. For example, a data set consisting of the number of wins for a single football team at each of several years is a single-dimensional (in this case, longitudinal) data set. A data set consisting of the number of wins for several football teams in a single year is also a single-dimensional (in this case, cross-sectional) data set. A data set consisting of the number of wins for several football teams over several years is a two-dimensional data set.
Association rule mining
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness
basket analysis
Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items.
method of association rule mining
aproiri algorithm
frequent patterns
What items are frequently purchased together in your SaveMart?
A typical association rule
P(Milk | Bread) = 0.7
types of learning
1) supersvised
2) unsupervised
supersvised
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples.
unsupervised learning
Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses. The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data.
Classification and Label Prediction
Construct models based on some training examples Describe and distinguish classes or concepts for future prediction Predict some unknown class labels
Typical Methods
of classification
Decision trees, Naïve Bayes, SVM, Neural Networks
Applications of classifications
Sentiment Classification
Spam Classification
Disease Prediction
Match result prediction
Credit Card Approval
example of classification. A credit card company receives thousands of applications for new cards. Each application contains information about applicant, Age Marital status Annual salary Outstanding debts Credit rating etc. Problem: to decide whether an application should approved or not.
Loan Application Example
classification example
Regression
Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Different regression models differ based on – the kind of relationship between dependent and independent variables, they are considering and the number of independent variables being used.
regression example
Predicting price of house when area is given.
Clustering
Unsupervised learning
Class label is unknown
Group data to form clusters
Clustering principle
Maximize inter-class difference and minimize intra-class difference
Methods
of clustering
K-Means Clustering
Cobweb
Applications
of clustering
Document clustering
Groups or similar people finding
Outlier Detection
In data mining, anomaly detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data
Outlier analysis
Also known as Anomaly Detection
Outlier is a data object that does not comply with the general behavior of the data.
Applications
of outlier detection
Rare event analysis
Fraud detection
Why Machine Learning?
The Explosive growth of data: from terabytes to petabytes
TB = 1024 GB, PB = 1024 TB
Why Machine Learning?Major source of data
Business: web, e-commerce, transactions, stocks …
Science : Remote sensing, bio-informatics
……
Society : news, YouTube, digital cameras
process a typical view of ML
INPUT ->> DATA PREPROCESSING->> MACHINE LEARNING->> POST PROCESSING->> PATTERN INFORMATION KNOWLEDGE
DATA PREPROCESSING
1) DATA INTEGRATION
2) NORMALIZATION
3) FEATURE SELECTION
4) DIMENSION REDUCTION
MACHINE LEARNING
1) PATTERN DISCOVERY
2) ASSOCIATION AND CORRELATION
3) CLASSIFICATION
4) CLUSTERING
4) OUTLIER ANALYSIS
POST PROCESSING
1) PATTERN EVALUATION
2) PATTERN SELECTION
3) PATTERN INTERPRETATION
4) PATTERN VISUALIZATION