Big Data And Machine Learning Part 2 Flashcards
Continuous data generation
Every financial institution generates data on a continuous basis in intervals smaller than one second
A legacy of software systems
Typical bank has undergone a series of M&As over the past 20 years
A complex and globally spanning entity structure
Every bank offers unique products, requiring different IT systems across its regional areas
Typical data landscape
1) data sources (credit cards, mortgages, savings) systems capturing information which can be bundled to add value
2) ETL - extract, transform , load (process for moving data from sources to central storage )
3) Data warehouse - bundling and storing data from different sources , building system of records to capture meta data or time stamps
4) data usage - performing analytics, visualisation or generating reports for supervisory or management purposes
Introducing a distributed storage layer which is
Schemaless (no predefined structure )
Durable (once data is writtten it shouldn’t be lost )
Capable of handling comment failure (without human intervention)
Automatically rebalanced (to even out disk space throughout cluster)
Big data today
Hadoop- open source collection of software tools for processing and computing on multiple modes
Apache spark - distributed cluster computing framework
Knowledge graph - describing and storing relations
Areas of concern in EBa paper
Access to info transformation
Cyber security risk
Market distortions caused by widespread automation
Limited data portability
Opportunities and challenges in risk management
Challenges :
Increasing model complexity and lack of explanatory insight
¥
How to audit or understand the model in a regulatory context
¥
Data availability and quality
Opportunities :
More granular and in depth analytic capabilities to predict :
default probabilities (credit risk)
And
Prepayment rates (lapse risk)
Money laundering and fraud detection
Algorithms
Random Forrest
Neutral network
Random Forrest
Collection of N randomly generated decision trees
Goal of a decision tree is to predict value of a target based on several input variables
1) Each tree is trained on a randomly drawn subset of the training set, process known as bootstrap aggregating
2) each candidate split or node has its own randomly generated subset of features
Advantages of random
Forrest
Handles multiple data sets, performs well with large data sets
Able to model non linear relationships
Finds interactions between variables
No assumptions about the distribution of data
Neutral network
Predicts based on certain input variables, what the outcome category will be
Loosely resembles the network of neurone that make up the human brain
Consists of connected modes, taking in a signal and passing on a different signal after them
Network learns from past experiences by modifying internal parameters and adapting itself
Adv and dis adv of neutral network
Adv
Able to maintain high performance without tendency to overfit
Can detect all possible interactions between predictor variables
Able to detect complex non linear relationships and to model surfaces of any shape (theoretically)
Disadvantage
Black box
Computationally intensive
Hyper parameter tuning is considered an art
Data quality and governance
Data quality - checks, specifications, requirements
Data governance - manage master data consistently, clean data and adhere to policies and standards
Data lineage - record used data sources, track user interaction with any data systems