Big Data & Cloud Module II Flashcards
What is data driven innovation?
The use of data and analytics to foster new prodcuts, processes and markets
What are analytics? Difference with advanced analytics and augmented?
Analytics are different business intelligence and analyses performed on data.
advanced analytics include data mining and machine learning for semi autonomous examination of data to discover deeper insights
augmented analytics are the use of tech like AI to assist with data preparation, insight generation and insight explanation
What is the difference between database and data warehouse?
Database: Designed for real-time transactional processing.
Data Warehouse: Designed for historical data analysis and decision-making.
Difference between OLTP and OLAP:
- OLTP (OnLine Transaction Processing): captures, stores and processes data from transactions in real time
- OLAP (OnLine Analytical Processing): uses complex queries to analyze aggregated historical data from OLTP systems
What is a Data Lake? Difference with DW?
A Data Lake contains all an organization’s data in raw, unstructured form. A Data warehouse contains structured data that has been cleaned and processed, ready for strategic analysis
What is data provenance and why is it important?
Data provenance is the documentation of where the data comes from and the processes and methodologies by which it has been produced.
it’s important for ensuring data quality, governance, compliance, understanding, integration, reproducibility, error detection, decision-making, and accountability. It enhances data trustworthiness, enables effective data management, and supports reliable and meaningful data analysis.
what is data provenance and why is it important?
Data Profiling is a technology for discovering and investingating data quality issues, like duplication, lack of consistency, and lack of accuracy.
What is data versioning and why is it important?
Data versioning serves the purpose of tracking changes associated with dynamic data, that is not static over time.
What is the data lakehouse? What is the difference between data fabric and data mesh?
The data lakehouse is a new data architecture that combines the flexibility and cost efficiency of the data lake with the management and ACID properties of data warehouses, implementing BI and ML
The difference between the data fabric and data mesh is that the data fabric is an architecture and set of data services that provide consistent capabilities across different endpoints, spanning multiple clpid environments
the data mesh is a decentralized data architecture, that organizes data by a specific business domain, like marketing, sales, customer service and so on. Data mesh makes the data discoverable, widely accessible, secure and interoperable
They are design concepts, not things: ○ They are not mutually exclusive ○ They are architectural frameworks, not architectures § The frameworks must be adapted and customized to your needs, data, processes, and terminology
What is MOSES?
The Moses Data Platform is a comprehensive and scalable data management and analytics platform. It provides a centralized solution for storing, processing, analyzing, and visualizing data in various organizations.
What is the difference between DevOps and DataOps?
DevOps: DevOps is a methodology that emphasizes collaboration, automation, and integration between development (Dev) and operations (Ops) teams. It aims to streamline the software development lifecycle
DataOps: DataOps is a methodology that applies DevOps principles to data management processes. It focuses on streamlining and automating data operations, including data integration, data quality, data governance, and data analytics. DataOps aims to accelerate the delivery of high-quality data for analytics and decision-making by promoting collaboration, automation, and monitoring across data teams.
Why go cloud?
Scalability: Cloud platforms offer virtually unlimited scalability, allowing you to easily handle large volumes of data and accommodate fluctuating workloads without the need for upfront infrastructure investments.
Cost Efficiency: Cloud services operate on a pay-as-you-go model, enabling you to optimize costs by only paying for the resources and storage you actually use. It eliminates the need for maintaining and managing expensive on-premises infrastructure.
Flexibility and Agility: Cloud platforms provide a wide range of big data tools and services that can be rapidly provisioned, allowing you to quickly experiment, develop, and deploy big data solutions. It enables faster time to market and agility in adapting to changing business requirements.
Global Data Availability: Cloud providers have data centers located across the globe, enabling you to store and process data closer to your users or specific regions. It improves data accessibility, reduces latency, and supports global operations.
Elasticity: Cloud platforms offer elastic resources, allowing you to dynamically scale up or down based on demand. This flexibility ensures optimal performance during peak usage periods and cost savings during periods of lower demand.
Data Security and Compliance: Cloud providers offer robust security measures, encryption, and compliance certifications, ensuring the protection and privacy of your big data. They often have dedicated teams focused on security, reducing the burden on your organization.
Advanced Analytics and AI: Cloud platforms provide a wide array of tools and services for performing advanced analytics, machine learning, and AI on big data. They offer pre-built algorithms, scalable computing power, and integration with other cloud services.
Types of cloud:
- Public: accessible to anyone willing to pay
- Private: accessible by people inside an institution
- Hybrid