Big Data And Machine Learning Part 1 Flashcards
5 Vs of big data
Volume - qty of data set
Velocity - speed at which new data is generated
Variety - different types and formats of data
Value
Veracity-accuracy
Types of machine learning problems
Classification
Regression
Clustering
Classification
Predict to which class an observation belongs Eg if an email is a spam or not
Labelled data
Fraud detection
Predict probability of default loanee
Straight line dividing different types of dotts
Measured by
Roc and area under curve
Confusion matrix
Variable importance
Regression
Predict numerical value of an observation based on inputs
Eg Predict house price
Unlabelled data
Curved line through data
Measured by
RMSE (root mean squared error)
R squared
Variable importance
Clustering
Group similar observations together
Customer segmentation
Group observations into circles
Types of learning styles
Supervised
Unsupervised
Reinforcement
Supervised learning
Learns how to map inputs to an outcome based on examples of inputs with known outcomes
1) Labelled data (set of variables, outcome variables, large number of observations)
2) Data preparation (explore data,missing values treatment, outlier treatment, transformations and binning)
3) train/ test - split data in test, train and validation, cross border validation
Unsupervised learning
Learn about structure of data set
Computer learns to find patterns by itself
Finds structure in data
Stops when iteration doesn’t improve model
Unlabelled data
Turns unlabelled data into structured data
Non-performance measure
1) unlabelled data
2) data prep
3) run algorithm
Reinforcement
Learning from interaction with an environment
Actions are rewarded positively or negatively
Algorithm maximises performance by pursuing maximum reward
Agent takes actions in an environment, which is interpreted into a reward and a representation of the state , which are fed back into the agent
Types of data
Structured data - data in a given format and with pre defined types (info in a excel spreadsheet with columns)
Semi structured - data containing two forms below but outside of conventional data storage, can be structured without classifying all necessary types (XML files)
Unstructured data - data being returned in different types (eg Audio, text etc) web searched such as google typically return unstructured data
Strategy of FIs , application of machine learning and business benefits
Strategy - faster innovation, data driven insights, competitive edge , cost reductions , beyond traditional modelling
Machine learning - problem formulation, data , software, hardware
Business benefits - higher profits, cost reduction, improved models , effective + continuous risk monitoring , less error prone processes
Definition of big data
Comes into play when conventional computational processing techniques can not cope with the sheer amount of data