LO1 LO2 Flashcards

1
Q

What is Analytics Platform

A

Tool to extract insight / value from data.

Integrated data platform, centralized data warehouse, ML / AI, data management platform.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does Cloud Analytics Platform do?

A
  • Enable access to different data sources
  • Access to comprehensive hardware resources
  • Make models deployable with seamless integration (real time, batch streaming) available from any API
  • Merge outcomes of different disciplines for one goal, insights.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does Analytics Platform do for IT?

A
  • Easy deployment
  • Governance
  • Self Service
  • Security and Compliance
  • Reliability and Performance
  • Cost Effective
  • Automation
  • Easy Migration
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Benefits of Analytics in Cloud

A
  • Less time spent on preparation → get insights faster
  • Confident decision making
  • Pay as you go - cheaper.
  • Less effort for insight generation
  • Data Combination - more data, combination of internal and external data.
  • Scalability - ease of upgrade of new computing resources, pay for what you need, probably cheaper than on premise computing, speed is faster.
  • Security
  • Efficient model development, Algorithms - AutoML, AutoAI, many algorithms available, some consider multiple.
  • Ability to code in any language. Easy to change architecture - less process, faster than on premise.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Roles in Analytics Deployment

A
  1. Business Sponsors - decision makers, hold finances, example: Heads of Analytics, LOB.
  2. IT Decision Makers - work on software and hardware infrastructure, focus on data sources, integration, discovery and sharing.
  3. Data Scientists - writing code, manipulating data, creating code, looking for correlations. Subject matter experts in the analytics field. Their work is often affected by bad data, siloed data, lead times for getting data).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data Ecosystem PPT

A

People - DevOps culture, sharing, aligned incentives.

Process - automation, agility, continuous deployments.

Technology - API and Microservices, Code Pipelines.

Balance of people, processes, and technologies drive organizational change
You need to balance the three components and maintain good relationships between them to maximize efficiency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Skills Across Analytics (TBNC)

A

Traditional - integration, storytelling, statistics, reporting.

BigData - structured, unstructured data, data restructuring, experimentation.

New Data Economy - ML, Cloud Analytics, change management.

Cognitive - NLP, AI, NN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Cloud Computing Pros and Cons

A

Pros | Cons |
| — | — |
| No upfront costs - pay as you go. | Reliance on internet |
| Easy software updates | Reliance on vendor / service provider |
| Reduces utility costs | Data Transfer - not easy. |
| Easy to learn | Bandwidth - billing can also be complex. |
| Ease of Access | Affected if many people use it |
| Centralization of Data | Security can still be an issue |
| Data Recovery | Non Negotiable Agreements |
| Sharing | Can grow costs with time |
| Security | Lack of full support |
| Free storage | Minimal Flexibility |

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is cloud computing?

A

Cloud Computing - class of network based computing that takes place over the Internet. Hide the complexity and details of the infrastructure from users and applications, provides simple graphical interface or API.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Characteristics of Cloud Computing (RACS)

A
  1. Remote: Services or data are hosted remotely.
  2. Available anywhere: Services or data are available from anywhere.
  3. Commodified: Pay for what you would want
  4. Self-managing services - provisioning services automatically without interaction with provider
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Cloud Computing Workload Patterns

A
  1. On and Off - batch job, over provisioned capacity is wasted.
  2. Growing fast - keeping up with growth is difficult, complex time for deployment.
  3. Unpredictable Bursting - unexpected jumps in demand, impacts performance but cannot overprovision for extremes.
  4. Predictable Bursting - seasonality trends, periodically increasing demand, complex and wasted capacity.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cloud Computing – Service Models (2)

A

Application Focused (SAD)

  • Services - business services, PayPal, Google Maps, Alexa etc.
  • Application - Google Apps, Microsoft Online (eliminates need for local installation)
  • Development - software used to build custom cloud apps (SalesForce)

Infrastructure Focused (PSH)

  • Platform - cloud based platforms, Amazon ECC.
  • Storage - data storage, iDisk etc.
  • Hosting - physical data centers, like IBM.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Three Service Models (SPI)

A
  1. SaaS - application accessed online not managed by your company, but by the software provider. This relieves organization from the constant pressure of software maintenance, infrastructure management, network security, data availability, etc. Consumer does not manage OS, storage, applications, but has control over the functionalities of the service. Example: Google Aps, Zoho, HubSpot, SalesForce, Google Docs,
  2. PaaS - build and deliver custom applications without the need of installing and working with IDEs, on cloud. Consumer does not or control underlying cloud infrastructure, OS, storage, but has control over deployed applications. Ex: Azure, Hadoop.
  3. Iaas - rent processing, storage, computing resources, virtual private servers on demand and over the web. Does not manage cloud infrastructure, but can manage OS, storage, applications deployed, databases, security etc. Example: Amazon EC2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

IaaS Enabled / Service

A

Enabled with: Virtualization

Service: Resource Management Interface & System Monitoring Interface

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Virtualization?

A

Abstraction of logical resources away from physical resources.

Multiple OS share same physical hardware and provide different services.

Benefits: security, convenience, availability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Virtual Workspaces

A

Abstraction of an execution environment that can be available to authorized clients, a cloud-based desktop, ensures provision of virtual machines for users that they can then get on the VPN and connect to and have the applications and access they need—and these machines can be properly secured. Ran on virtual machines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Benefits of Virtual Machines

A
  • Running of systems where hardware cannot handle.
  • Easier to create new machines.
  • Test software on clean installs of OS.
  • More machines than are physically available.
  • Easy migration.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Properties of Virtual Machines (MIARSE)

A
  • Manageability Interoperability
  • Availability and Reliability
  • Scalability and Elasticity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Two Elements of IaaS

A
### Resource Management Interface (MSN)
### System Monitoring Interface
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Resource Management Interface (MSN) 3 Elements

A
  • Virtual Machine - provide basic virtual machine operations, i.e. creation, suspection, termination etc.
  • Virtual Storage - basic virtual storage operations. i.e. space allocation, space release, data writing, reading.
  • Virtual Network - basic virtual network operations, IP allocation, domain name register, connection, bandwidth.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

System Monitoring Interface Elements (MSN)

A
  • Virtual Machine - monitor CPU usage, memory, network loading
  • Virtual Storage - space usage, data duplication, bandwidth..
  • Virtual Network - network bandwidth, load balancing.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

PaaS Enabled and Provide Service

A

Enabled through: Runtime Environment Design - collection of software services available.

Provide Service: Programming IDE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What services does PaaS provide?

What should it offer?

A

Programming APIs & Development tools

Users can use the Programming IDE to develop their service. Should provide full functionalities of an environment

Should offer a debugger, testing environment, profiler…

Offers computation, storage, communication resource operation…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

SaaS Enabling and Providing

A

Enabling technique: Web Service

Provide Services: Web Based Applications & Web Portals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Web Based Applications Types in SaaS

A
  • General - general purpose, multimedia, instant message etc.
  • Business - CRM, market trading system, etc.
  • Scientific - simulation software.
  • Government - national medical system, public transportation system etc.
26
Q

What are Web Portals?

A

Other from standard search engine features, additional things like news, email, entertainment etc. All of them look and feel same, access control for all of them, otherwise they would have all been a different service. Ex: Google, Yahoo etc.

27
Q

List 4 Cloud Deployment Models (PPCH)

A
  1. Public Cloud
    Available to general public, large group, owned by organization selling the cloud service. Openly accessible. Ex: Gmail, Google Drive
    Homogeneous, common policies, shared resources, leased infrastructure.
  2. Private Cloud
    Operated for a single organization, managed by them or third party, on premise or off premise. Limited access to just the company. Ex: HP Data Centers, Ubuntu Cloud
    Heterogeneous, customized and unique, dedicated resources, in house infrastructure.
  3. Community Cloud
    Shared by several organizations, supports community of shared needs (mission, security requirements etc). Ex:
  4. Hybrid Cloud
    Two or more clouds (private, community or public) that are independent but are grouped together, sharing information between them. Benefit: store very sensitive information on private cloud, meeting of legal requirements.
28
Q

Oracle Autonomous Database

A

An autonomous database -uses machine learning to automate database tuning, security, backups, updates, and other routine management tasks traditionally performed by DBAs. Unlike a conventional database, an autonomous database performs all these tasks and more without human intervention.

  • Complete infrastructure automation
  • Complete Database Automation
29
Q

Two Elements of Oracle Autonomous DB

A
  • Data Warehouse

- Transaction Processing

30
Q

Three Features of OADB

A
  • Self-driving
    All database management, monitoring is automated.
  • Self-securing
    Built-in capabilities protect against both external attacks and malicious internal users.
  • Self-repairing
    This can prevent downtime, including unplanned maintenance.
31
Q

Autonomous Data Warehouse

A

A fully automated DWH, preconfigured with columnar format, partitioning, and large joins to simplify and accelerate database provisioning, extracting, loading, and transforming data; running sophisticated reports; generating predictions; and creating machine learning models.

Data scientists, business analysts, and non-experts can rapidly, easily, and cost-effectively discover business insights using data of any size and type.

  • eliminates nearly all manual administrative tasks.
  • automates common tasks like backup, configuration, and patching.
  • comprehensive data and privacy protection
32
Q

Three Tiers of Cloud Analytics

A
  1. Raw - Ingest the data as it comes. Data Engineers. → Shared data lake
    All data should land from source in raw format. Can come from IOT, streaming, images, comments, or anything. Keys do not matter here, data should not be harmonized - take anything you might use one day. Projects will fail if they take too long to organize data here without knowing what it is for.
  2. Optimized - Data Scientist, Analyst. Enriched, compressed, optimized for specific use case. Every analytical use case should have its own flow of data, even if this means data repeats. This way we have different ways of looking at data from multiple perspectives based on business purpose. Data is transformed from raw format and optimized, aggregated from raw to specific business question.
  3. Cache - optimized for access pattern. DevOps Engineers. cache some of the data into a dedicated data store, like database. Data much more expensive and used for actual use cases.
33
Q

Flywheel of AI (4 Stages)

A
  1. Feed business use cases that can benefit from data and ML.
  2. Adding more data sources for solving above cases.
  3. Improving data analysis and data science for analytics and ML.
  4. Improving production level and usability.
34
Q

4 Roles in a Good Team of Cloud Analytics

A
  1. Product Managers - business owner focuses on problem solving.
  2. Data Engineers - what data can we use. Building data pipelines, data lakes.
  3. Data Scientists - how we can make good models from it. Sizing
  4. DevOps - practices, tools, and a cultural philosophy that automate and integrate the processes between software development and IT teams. DevOps engineers should be on building and optimizing the deployment pipelines.
35
Q

AWS Reference Architecture

A

AWS Reference Architecture

Tier 1 - Raw

  1. Replication of transnational databases with Data Migration Service. Can connect to various RDBMS and replicate the data to S3.
  2. Or upload files using SFTP to S3. Many ways to design shared datalake tier..

Tier 2 - Optimized for business question

Many environments that allow teams to experiment with data and analytics, each team can pull any data they need for their use. Vital to have good governance for access and security.

36
Q

What is Glue?

A

Managed ETL Service on AWS

37
Q

What is SageMaker

A

ML service. Data scientists can build ML models, then deploy them into a production-ready environment, integrated Jupyter instance, common ML algorithms.

38
Q

What is AWS Lambda

A

runs code in response to events and automatically manages the computing resources required by that code. Runs code when needed only, scales automatically, pay only for what you consume.

39
Q

What is Kinesis

A

real-time service for stream processing.

40
Q

What is Aurora

A

relational database. onn AWS

41
Q

What is QuickSight

A

ML powered BI tool, lets you create interactive BI dashboards that include Machine Learning-powered insights.

42
Q

Azure Reference Architecture 3 Apps

A

Azure Data Factory
Azure Databricks
Azure Synapse Analytics

43
Q

What si Azure Data Factory

A

managed, server-less data integration solution for ingesting, preparing, and transforming data at scale.Visually integrate data sources, connectors, build ETL and ELT processes code-free.

44
Q

What is Azure Synapse Analytics

A

analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Allows to query data, ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs.

45
Q

What is Azure Databricks

A

cloud-based data engineering toolused forprocessing and transforming massive quantities of data and exploring the data through machine learning models.

46
Q

3 Tiers of DWH Traditional Structure

A
  • Bottom Tier - database server extract data from many sources.
  • Middle Tier - OLAP sever, transforms structure for analysis and complex queries.
  • Top Tier - client layer, tools used for high-level data analysis, querying reporting, and data mining.
47
Q

DWH Traditional Models - 3 Types of DWH

A
  • Virtual Data Warehouse - separate databases that can be queried together, as if on one DWH.
  • Data mart - specific line business reporting and analysis, aggregated from resources relevant to this area.
  • Enterprise Data Warehouse - aggregated data from entire organization, data integrated from all business units - heart of information.
48
Q

What is Massively parallel processing (MPP):?

A

Divides a single computing operation to execute simultaneously across a multiple processors.
This division of labor facilitates faster storage and analysis.

49
Q

What is Vectorized processing?

A

Takes advantage of the recent and revolutionary computer chip designs. Makes more efficient use of modern CPUs by changing the data orientation (from rows to columns) to get more out of the CPU cache and deep instruction pipelines.

50
Q

What are Solid state drives (SSDs)

A

SSDs store data on flash memory chips, accelerates data storage, retrieval, and analysis.
A solution that takes advantage of SSDs can deliver significantly better performance

51
Q

DWH Cloud Three Categories

A
  1. Traditional DWH deployed on cloud: uses original code base, needs IT, no need to buy or install hardware and software, backups needed.
  2. Traditional DWH hosted and managed in the cloud by a third party: hosted in data center managed by vendor, customer specifies in advance how much space and resources are needed.
  3. SaaS DWH - vendor delivers complete DWH (all software, hardware, DBA), customers only pay for storage and resources, scales up and down on demand.
52
Q

Cloud DWH Benefits

A
  • Customer Experience - better monitoring, product improvements, better analysis - better decision making for customers.
  • Quality assurance - can be used to monitor customer service issues and react sooner.
  • Operational efficiency - see where cost reduction is possible, streamline processes.
  • Innovation - use new data to innovate and grow.
53
Q

When can Cloud DWH do more?

A
  • Exploration - cannot predict resources needed for large datasets, so elastic scalability is good.
  • Ad Hoc Analysis - answer one single specific business question all the time. Dynamic elasticity provides the to perform these queries without slowing down other workloads.
  • Event-driven analytics - constant data demand, continual updates, elasticity is need to handle variations in data.
  • Embedded analytics - build analytics into business applications, in cloud. Many users query the applications, workload can be high, elasticity is needed.
54
Q

DWH Criteria

A
  • Integrates data in one place
  • Supports existing tools and expertise
  • Saves money - how long the company will need to wait before it can start to capitalize?
  • Provides recovery
  • Secures data (confidential and integrity) + MFA. Industry standard end-to-end.
  • Role based control - users only see what they should see.
  • Provider must perform periodic security testing, should not affect performance.
55
Q

On Premise DWH. Cons?

A
  • Hardware - servers, additional storage devices, data center space to house the hardware, a high-speed network to access the data, and the power and redundant power supplies needed to keep the system up and running.
  • Software (licensing) - money in software licensing fees for data warehouse software and add-on packages.
  • Administration - needs specialized, information technology (IT) personnel to deploy and maintain the system.
56
Q

The cloud data warehouse should scale in three ways: Which ones?

A

Storage: scalable, easily adjusting the amount of storage to meet changing needs.

Compute - resources for processing data should easily scale up or down, at any time, as the number and intensity of the workloads change.

Users and workloads (concurrency): scale out to support more users and workloads without negatively impacting performance.

57
Q

Steps to choose DWH?

A
  1. Evaluate needs
  2. Migrate or fresh start
  3. Measures of Success
  4. Evaluate solutions
  5. Total Cost of Ownership
  6. Proof of Concept (POC)
58
Q

DWH – Amazon Redshift. How does it work?

A

Requires computing resources to be provisioned and set up in the form of clusters.
Each node has its own CPU, storage, and RAM.
A leader node compiles queries and transfers them to compute nodes, which execute the queries.
**On each node, data is stored in chunks, called slices.
Uses a columnar storage.

59
Q

DWH – Google BigQuery/. How does it work?

A

Google dynamically manages the allocation of machine resources

Lets clients load data from Google Cloud Storage and other readable data sources

Stream data, which allows developers to add data to the data warehouse in real-time, row-by-row, as it becomes available.

  • *Dremel** uses massively parallel querying to scan data, Colossus file management system distributes files into chunks among many computing resources named nodes, which are grouped into clusters.
  • *Dremel uses a columnar data structure, similar to Redshift.**
60
Q

Snowflake

A

Built on a cluster architecture and delivered as a service, integrates data ingestion, storage, and analysis into one system, but separates storage and compute to enable rapid scale and more efficient resource utilization.

61
Q

Microsoft Azure. What is it?

A

SQL data warehouse that combines SQL relational databases with massively parallel processing that allows users to run complex queries. Users can easily integrate data storage, transmission, and processing services into automated data pipelines.

62
Q

Cloud DWH - Challenges

A
  • Loading data to cloud data warehouses
    Requires setting up, testing, and maintaining an ETL process.
  • Updates, inserts, and deletions
    Can be tricky and must be done carefully to prevent degradation in query performance
  • Semi-structured data
    Difficult to deal with - needs to be normalized into a relational database format, which requires automation.
  • Cluster optimization
    Data sets, or different types of queries might require a different setup
    Need to continually revisit and tweak setup
  • Query optimization
    Optimize queries so that the data warehouse can perform as expected\
  • Backup and recovery
    Not easy to set up and require monitoring and close attention.