Big Data And Machine Learning Part 2 Flashcards

Question 1

Q

Continuous data generation

Answer

A

Every financial institution generates data on a continuous basis in intervals smaller than one second

Question 2

Q

A legacy of software systems

Answer

A

Typical bank has undergone a series of M&As over the past 20 years

Question 3

Q

A complex and globally spanning entity structure

Answer

A

Every bank offers unique products, requiring different IT systems across its regional areas

Question 4

Q

Typical data landscape

Answer

A

1) data sources (credit cards, mortgages, savings) systems capturing information which can be bundled to add value
2) ETL - extract, transform , load (process for moving data from sources to central storage )
3) Data warehouse - bundling and storing data from different sources , building system of records to capture meta data or time stamps
4) data usage - performing analytics, visualisation or generating reports for supervisory or management purposes

Question 5

Q

Introducing a distributed storage layer which is

Answer

A

Schemaless (no predefined structure )

Durable (once data is writtten it shouldn’t be lost )

Capable of handling comment failure (without human intervention)

Automatically rebalanced (to even out disk space throughout cluster)

Question 6

Q

Big data today

Answer

A

Hadoop- open source collection of software tools for processing and computing on multiple modes

Apache spark - distributed cluster computing framework

Knowledge graph - describing and storing relations

Question 7

Q

Areas of concern in EBa paper

Answer

A

Access to info transformation

Cyber security risk

Market distortions caused by widespread automation

Limited data portability

Question 8

Q

Opportunities and challenges in risk management

Answer

A

Challenges :
Increasing model complexity and lack of explanatory insight
¥
How to audit or understand the model in a regulatory context
¥
Data availability and quality

Opportunities :

More granular and in depth analytic capabilities to predict :
default probabilities (credit risk)
And
Prepayment rates (lapse risk)

Money laundering and fraud detection

Question 9

Q

Algorithms

Answer

A

Random Forrest

Neutral network

Question 10

Q

Random Forrest

Answer

A

Collection of N randomly generated decision trees

Goal of a decision tree is to predict value of a target based on several input variables

1) Each tree is trained on a randomly drawn subset of the training set, process known as bootstrap aggregating
2) each candidate split or node has its own randomly generated subset of features

Question 11

Q

Advantages of random

Forrest

Answer

A

Handles multiple data sets, performs well with large data sets

Able to model non linear relationships

Finds interactions between variables

No assumptions about the distribution of data

Question 12

Q

Neutral network

Answer

A

Predicts based on certain input variables, what the outcome category will be

Loosely resembles the network of neurone that make up the human brain

Consists of connected modes, taking in a signal and passing on a different signal after them

Network learns from past experiences by modifying internal parameters and adapting itself

Question 13

Q

Adv and dis adv of neutral network

Answer

A

Adv

Able to maintain high performance without tendency to overfit

Can detect all possible interactions between predictor variables

Able to detect complex non linear relationships and to model surfaces of any shape (theoretically)

Disadvantage
Black box
Computationally intensive
Hyper parameter tuning is considered an art

Question 14

Q

Data quality and governance

Answer

A

Data quality - checks, specifications, requirements

Data governance - manage master data consistently, clean data and adhere to policies and standards

Data lineage - record used data sources, track user interaction with any data systems

Big Data And Machine Learning Part 2 Flashcards

(14 cards)