Data Lakes Flashcards
What process takes the most time in producing timely insights?
A disproportionate amount of time (70%) is spent on data preparation:
- acquiring
- preparing
- formatting
- normalizing
Define Data Lakes
Data lakes focus on storing disparate data and ignore how or why data is used, governed, defined and secured.
Purpose of Data Lakes
Aims to solve two problems
Avoid information silos. Rather than having dozens of independently managed collections of data, you can combine these sources in the unmanaged data lake. The consolidation theoretically results in increased information use and sharing, while cutting costs through server and license reduction.
Big Data projects require a large amount of varied information. The information is so varied that it’s not clear what it is when it is received, and constraining it in something as structured as a data warehouse or relational database management system (RDBMS) constrains future analysis.
Different teams need different data.
The data lake aims to provide the appropriate data access to the business in a cost effective manner that also protects and governs data.
Benefits of Data Lakes
- Suited to extremely large volumes
- Open and flexible architecture
- Integrating existing data assets, e,g, EDW
- Future expandability
- Data democratization across the business through holistic governance
- Provides different teams with different data from the same source.
The 4 catgories of systems and what data they produce
Systems of Record - Context data related to the business transactions of the organization (OLTP)
Systems of Engagement - Big data about the activity of the individual using a mobile device
Systems of Automation - Big data from sensors monitoring an asset or location (IoT, Industrial IoT)
Systems of Insight - Analytics based on historical data collected from multiple sources
Data Lakes vs. Data Marts
Data Mart
- bottled, cleansed, water
Data Lake
- more natural, free water
- lake fed with data from multiple sources
- users can dive in, take samples, examine
Data Warehouse vs. Data Lake
8 differences
1) For the most part, Data Warehouse is set up to support BI that form a large part of the requirements that business gives to BI projects. Data Lakes enable people to do their own analyses. (ad-hoc: “created or done, for a particular purpose, as necessary”)
4) Data Warehouse schema known in advance, through design, instead of finding it out as you load the data. Similarly, Data Warehouse usually run on fixed servers in a data centre (although Cloud is becoming popular) versus Data Lakes may be largely in the Cloud, with computing power called upon as needed.
7) Data Lakes –big data –used by data scientists
Data Lake vs. Data Swamp
data swamp: lot’s of unkown!
unclear data…
- …location
- …origin
- …ownership
- …purity / quality
- …presence
- …protection
- …timeliness
- …reliability (of data feeds and results)
- …classification
A ‘good’ data lake has no open questions.
3 Elements of Data Lakes
- Data Lake Repositories (data stores)
- Data Lake Services (operate on the data)
- Information Management and Governance (prevent Data Swamps)
Simplified Data Lake Architecture
On the left, we see data coming in from systems of Record, Systems of Engagement and Systems of Automation
Various data buckets or repositories.
Less structured data can be used by Data Scientists for analysis; sometimes they will transform it as well.
Some of the data will be loaded into the Data Warehouse for use by Business Analysts.
A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.
Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values.
Some of the data will be loaded into an in-memory database such as SAP HANA for use by Business Analysts to do real-time or streaming analytics
The really important message is the emphasis on Governance, enabled by Metadata; keeps a record of:
- Lineage (Provenance),
- Age (how old the data is),
- Privacy (what policies should be applied)
- Usage history
Users supported by (IBM’s) Data Lake
Analytics teams – Analysts, Data Scientists, etc.
Information Curator – a person who is responsible for creating, maintaining, and correctingany errors in the description of the information store in the governance catalog.
- (other roles:
- Information Owner: A person who is accountable for the correct classification andmanagement of the information within a system or store.
- Information Steward: aperson who is responsible for correcting any errors in the actualinformation in the information store.)
Governance, Risk and Compliance Team – responsible for Information Governance
Line of Business Teams – the business users (business analysts, managers etc.)
Data Lake Operations – the IT Run Organisation that keeps the Data Lake running and providing service to the users
The subsystems inside (IBM’s) Data Lake
Enterprise IT Data Exchange: about getting data into the Data Lake
Self-Service Access has various viewpoints:
- A user being provided with all the capabilities they need to be able to search and find the data that they need without support.
- A data owner being able to ensure that the correct level of governance is enforced on their data.
- A system administrator being notified of impending resource allocations being exceededon a file system.
- A fraud investigation officer being provided with the tools they need to investigatesuspicious activity within the reservoir.
2 types of self-service access
- one for data scientists and their ilk (type of person)
- and less sophisticated users
Catalogue: Much of the operation of the data reservoir is centered on a catalog of the information that the data reservoir is aware of. The catalog is populated using a process that is called curation.
The subsystems inside (IBM’s) Data Lake - Part II
Self service access for data scientists allows them to quickly develop analytics often usinga “sandbox” system
Accuracy in the catalog builds trust with the users
The business users can use self service to do “what-if” experiments and analysis
New sources of data can be imported to be analyzed
The subsystems inside (IBM’s) Data Lake - Part III
View from the user community - fraud
Data Lakes are useful for fraud investigation and protection development.
The business users in the fraud team can use self service to investigate cases of fraud
They would use data in the Data Lake to detect (perhaps using stream analytics) fraud
The data scientists would develop new models for fraud detection
The compliance team can use the data in the Lake to report their compliance to their regulators – a regulator is a public authority or government agency responsible for exercising autonomous authority over some area of human activity in a regulatory or supervisory capacity, e.g. the Information Commissioner’s Office or the Financial Conduct Authority
Data Lakes have catalogs (like a library)
Key things: governance, lineage, metadata
Information curation - the creation of the description of the data in the data lake’s catalog.
The information owner is responsible for the accuracy of this definition. However, for enterprise IT systems, the information owner is often a senior person who delegates to aninformation curator. The information curator understands what that data is and can describe it in an entry in the catalog.