Data Lakes Flashcards
What process takes the most time in producing timely insights?
A disproportionate amount of time (70%) is spent on data preparation:
- acquiring
- preparing
- formatting
- normalizing
Define Data Lakes
Data lakes focus on storing disparate data and ignore how or why data is used, governed, defined and secured.
Purpose of Data Lakes
Aims to solve two problems
Avoid information silos. Rather than having dozens of independently managed collections of data, you can combine these sources in the unmanaged data lake. The consolidation theoretically results in increased information use and sharing, while cutting costs through server and license reduction.
Big Data projects require a large amount of varied information. The information is so varied that it’s not clear what it is when it is received, and constraining it in something as structured as a data warehouse or relational database management system (RDBMS) constrains future analysis.
Different teams need different data.
The data lake aims to provide the appropriate data access to the business in a cost effective manner that also protects and governs data.
Benefits of Data Lakes
- Suited to extremely large volumes
- Open and flexible architecture
- Integrating existing data assets, e,g, EDW
- Future expandability
- Data democratization across the business through holistic governance
- Provides different teams with different data from the same source.
The 4 catgories of systems and what data they produce
Systems of Record - Context data related to the business transactions of the organization (OLTP)
Systems of Engagement - Big data about the activity of the individual using a mobile device
Systems of Automation - Big data from sensors monitoring an asset or location (IoT, Industrial IoT)
Systems of Insight - Analytics based on historical data collected from multiple sources
Data Lakes vs. Data Marts
Data Mart
- bottled, cleansed, water
Data Lake
- more natural, free water
- lake fed with data from multiple sources
- users can dive in, take samples, examine
Data Warehouse vs. Data Lake
8 differences
1) For the most part, Data Warehouse is set up to support BI that form a large part of the requirements that business gives to BI projects. Data Lakes enable people to do their own analyses. (ad-hoc: “created or done, for a particular purpose, as necessary”)
4) Data Warehouse schema known in advance, through design, instead of finding it out as you load the data. Similarly, Data Warehouse usually run on fixed servers in a data centre (although Cloud is becoming popular) versus Data Lakes may be largely in the Cloud, with computing power called upon as needed.
7) Data Lakes –big data –used by data scientists

Data Lake vs. Data Swamp
data swamp: lot’s of unkown!
unclear data…
- …location
- …origin
- …ownership
- …purity / quality
- …presence
- …protection
- …timeliness
- …reliability (of data feeds and results)
- …classification
A ‘good’ data lake has no open questions.
3 Elements of Data Lakes
- Data Lake Repositories (data stores)
- Data Lake Services (operate on the data)
- Information Management and Governance (prevent Data Swamps)

Simplified Data Lake Architecture
On the left, we see data coming in from systems of Record, Systems of Engagement and Systems of Automation
Various data buckets or repositories.
Less structured data can be used by Data Scientists for analysis; sometimes they will transform it as well.
Some of the data will be loaded into the Data Warehouse for use by Business Analysts.
A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.
Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values.
Some of the data will be loaded into an in-memory database such as SAP HANA for use by Business Analysts to do real-time or streaming analytics
The really important message is the emphasis on Governance, enabled by Metadata; keeps a record of:
- Lineage (Provenance),
- Age (how old the data is),
- Privacy (what policies should be applied)
- Usage history

Users supported by (IBM’s) Data Lake
Analytics teams – Analysts, Data Scientists, etc.
Information Curator – a person who is responsible for creating, maintaining, and correctingany errors in the description of the information store in the governance catalog.
- (other roles:
- Information Owner: A person who is accountable for the correct classification andmanagement of the information within a system or store.
- Information Steward: aperson who is responsible for correcting any errors in the actualinformation in the information store.)
Governance, Risk and Compliance Team – responsible for Information Governance
Line of Business Teams – the business users (business analysts, managers etc.)
Data Lake Operations – the IT Run Organisation that keeps the Data Lake running and providing service to the users

The subsystems inside (IBM’s) Data Lake
Enterprise IT Data Exchange: about getting data into the Data Lake
Self-Service Access has various viewpoints:
- A user being provided with all the capabilities they need to be able to search and find the data that they need without support.
- A data owner being able to ensure that the correct level of governance is enforced on their data.
- A system administrator being notified of impending resource allocations being exceededon a file system.
- A fraud investigation officer being provided with the tools they need to investigatesuspicious activity within the reservoir.
2 types of self-service access
- one for data scientists and their ilk (type of person)
- and less sophisticated users
Catalogue: Much of the operation of the data reservoir is centered on a catalog of the information that the data reservoir is aware of. The catalog is populated using a process that is called curation.

The subsystems inside (IBM’s) Data Lake - Part II
Self service access for data scientists allows them to quickly develop analytics often usinga “sandbox” system
Accuracy in the catalog builds trust with the users
The business users can use self service to do “what-if” experiments and analysis
New sources of data can be imported to be analyzed

The subsystems inside (IBM’s) Data Lake - Part III
View from the user community - fraud
Data Lakes are useful for fraud investigation and protection development.
The business users in the fraud team can use self service to investigate cases of fraud
They would use data in the Data Lake to detect (perhaps using stream analytics) fraud
The data scientists would develop new models for fraud detection
The compliance team can use the data in the Lake to report their compliance to their regulators – a regulator is a public authority or government agency responsible for exercising autonomous authority over some area of human activity in a regulatory or supervisory capacity, e.g. the Information Commissioner’s Office or the Financial Conduct Authority

Data Lakes have catalogs (like a library)
Key things: governance, lineage, metadata
Information curation - the creation of the description of the data in the data lake’s catalog.
The information owner is responsible for the accuracy of this definition. However, for enterprise IT systems, the information owner is often a senior person who delegates to aninformation curator. The information curator understands what that data is and can describe it in an entry in the catalog.

Data Lineage
The Lifecycle of Data
Done through Meta Data Management (?)

Data Governance
Governance ensures proper management and use of information.
Diagram shows
- the questions that governance should answer
- the components of information governance

10 Considerations for a well-managed and governed data lake

How information governance provides the mechanism for building trust
The Information broker is the runtime server environment for running the integration processes that move data into and out of the data reservoir and between components within the reservoir. It typically includes an extract, transform, and load (ETL) engine for moving around data.
The Code hub is used primarily to facilitate transcoding of data coming into the reservoir and data feeds flowing out. Additionally, to support analytics the reference data can map the canonical forms tostrings to make it easier for the analytics user and their activities.
Staging areas are used to manage movement of data into, out of, and around the datareservoir, and to provide appropriate decoupling between systems.The implementation can include database tables, directories within Hadoop, messagequeues, or similar structures.
The operational governance hub provides dashboards and reports for reviewing andmanaging the operation of the data reservoir. Typically it is used by the following groups:
– Information owners and data stewards wanting to understand the data quality issues inthe data they are responsible for that have been discovered by the data reservoir.
– Security officers interested in the types and levels of security and data protectionissues that have been raised.
– The data reservoir operations team wanting to understand the overall usage and performance of the data reservoir.
Monitor - Like any piece of infrastructure, it is important to understand how the data reservoir isperforming. Are there hotspots? Are you getting more usage than you expected? How are you managing your storage? The data reservoir has many monitor components deployed that record the activity in thedata reservoir along with its availability, functionality, and performance. The management of any alerts that the monitors raise can be resolved using workflow.
Workflow - Successful use of a data reservoir depends on various processes involving systems,users, and administrators. For example, provisioning new data into the data reservoir might involve an information curator defining the catalog entry to describe and classify thdata. An information owner must approve the classifications, and an integration developermust create the data ingestion process. Workflow coordinates the work of these people.
Guards are controls within the reservoir to enforce restrictions on access to data andrelated protection mechanisms. These guards can include ensuring the requesting user isauthorized, data masking being applied, or certain rows of data being filtered out.

Organisations expect information governance to deliver…
- Understanding of the information they have
- Confidence to share and reuse information
- Protection from unauthorised use of information
- Monitoring of activity around the information
- Implementation of key business processes that manage information
- Tracking the provenance of information and
- Management of the growth and distribution of their information.
Establishing Information Governance Policies
These are called the governance principles since they underpin all other information governance decisions.

Governance Rules
Defined for each classification for each situation
Policies translate to rules

Data Classifications
Classification is at the heart of information governance.
- It characterizes the type, value, and cost of information, or the mechanism that manage it.
- The design of the classification schemes is key to controlling the cost and effectiveness of the information governance program.
Key Requirements
- Reporting mechanims
- Consumption tracking
- Security
- Privacy
- Chargeback models (used to apportion the costs of running the data lake amongst its users)
Business Classifications
- Characterize information from a business perspective. This captures its value, how it is used, and the impact to the business if it is misused.
Role Classifications
- Characterize the relationship that an individual has to a particular kind of data.
Resource Classifications
- Characterize the capability of the IT infrastructure that supports the management of information. A resource’s capability is partly due to its innate functions and partly controlled by the way it has been configured.
Activity Classifications
- Help to characterize procedures, actions and automated processes.
Semantic Mapping
- Identifies the meaning of an information element. The classification scheme is a glossary of concepts from relevant subject areas (industry specific, shipped with industry models).
- The semantic classifications are defined at two levels:
- Subject area mapping
- Business term mapping
Data Privacy Classifications
- Crtical, Restricted, Highly Confidential, Confidential, Public

Information Governance in the broader context:
Obligations and Delegations
Export controls/restrictions …
The governance should help you in reporting your compliance with laws or regulations

3 interlocking lifecycles of information governance

Information Governance Personas
The information governance personas describe people whose full time role is the governance and protection of information.
They oversee the programs that:
- set policies
- monitor measure and feed back on compliance to the policies
Chief Data Officer (CDO) – manages overall information governance
Security Officer - manages protection of personal data and research IP
Auditor – ensuring that governance rules and policies are followed; will interface with external auditors

Setting Up a Governance Programme
Information governance should start small and prioritise most important information, and then expand out as it demonstrates its worth.
It must evolve with the business, be responsive and accountable, while seeking to communicate and educate people in the appropriate management of information.
Most important, it needs senior stakeholders and visible consequences for those who ignore the requirements.
Secure access to the data lake’s data
The data lake’ssecurity is assured with this combination of business processes and technical mechanisms.

Two layer defence – Inner and Outer Ring
The Data Lake Architecture 2 layers of protection:
Outer Ring: where the identity of a person or caller is established - represented by the data lake’s services.
Inner Ring: The inner boundary encircles the data lake repositories. The repositories inside this boundary have very restricted security access so only the approved analytics, processes and services sitting inside the data lakecan access the data lake repositories.

Building a data lake
The data lake needs governance and change management to ensure that information is protected and managed efficiently.
The first step in creating the lake is to establish the information integration and governance components, the staging areas for integration, the catalog, the common data standards.
Building the lake then proceeds iteratively based on the following processes:
- Governance of a data lake subject area.
- Managing an information source.
- Managing an information view.
- Enabling analytics.
- Maintaining the data lake infrastructure.
Roles within the Data Lake

Ethics for Big Data and Analytics
Context – for what purpose was the data originally surrendered? For what purpose is the data now being used? How far removed from the original context is its new use?
Consent & Choice – What are the choices given to an affected party? Do they know they are making a choice? Do they really understand what they are agreeing to? Do they really have an opportunity to decline? What alternatives are offered?
Reasonable – is the depth and breadth of the data used and the relationships derived reasonable for the application it is used for?
Substantiated – Are the sources of data used appropriate, authoritative, complete and timely for the application?
Owned – Who owns the resulting insight? What are their responsibilities towards it in terms of its protection and the obligation to act?
Fair – How equitable are the results of the application to all parties? Is everyone properly compensated?
Considered – What are the consequences of the data collection and analysis?
Access – What access to data is given to the data subject?
Accountable – How are mistakes and unintended consequences detected and repaired? Can the interested parties check the results that affect them?