Big Data And Machine Learning Part 1 Flashcards

1
Q

5 Vs of big data

A

Volume - qty of data set
Velocity - speed at which new data is generated

Variety - different types and formats of data
Value
Veracity-accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Types of machine learning problems

A

Classification

Regression

Clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Classification

A
Predict to which class an observation belongs 
Eg if an email is a spam or not 

Labelled data

Fraud detection

Predict probability of default loanee

Straight line dividing different types of dotts

Measured by
Roc and area under curve
Confusion matrix
Variable importance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Regression

A

Predict numerical value of an observation based on inputs

Eg Predict house price

Unlabelled data

Curved line through data

Measured by
RMSE (root mean squared error)

R squared

Variable importance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Clustering

A

Group similar observations together

Customer segmentation

Group observations into circles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Types of learning styles

A

Supervised

Unsupervised

Reinforcement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Supervised learning

A

Learns how to map inputs to an outcome based on examples of inputs with known outcomes

1) Labelled data (set of variables, outcome variables, large number of observations)
2) Data preparation (explore data,missing values treatment, outlier treatment, transformations and binning)
3) train/ test - split data in test, train and validation, cross border validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Unsupervised learning

A

Learn about structure of data set

Computer learns to find patterns by itself

Finds structure in data
Stops when iteration doesn’t improve model

Unlabelled data

Turns unlabelled data into structured data

Non-performance measure

1) unlabelled data
2) data prep
3) run algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Reinforcement

A

Learning from interaction with an environment

Actions are rewarded positively or negatively

Algorithm maximises performance by pursuing maximum reward

Agent takes actions in an environment, which is interpreted into a reward and a representation of the state , which are fed back into the agent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Types of data

A

Structured data - data in a given format and with pre defined types (info in a excel spreadsheet with columns)

Semi structured - data containing two forms below but outside of conventional data storage, can be structured without classifying all necessary types (XML files)

Unstructured data - data being returned in different types (eg Audio, text etc) web searched such as google typically return unstructured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Strategy of FIs , application of machine learning and business benefits

A

Strategy - faster innovation, data driven insights, competitive edge , cost reductions , beyond traditional modelling

Machine learning - problem formulation, data , software, hardware

Business benefits - higher profits, cost reduction, improved models , effective + continuous risk monitoring , less error prone processes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Definition of big data

A

Comes into play when conventional computational processing techniques can not cope with the sheer amount of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly