Big Data - Week 6 Flashcards
Enterprise Data
Operational data of an organisation
Typically managed by enterprise information systems that cover both mainstream tasks and analyses over them.
Traditionally analytics over operational data uses data warehouses, with data extracted and reorganised to support reporting.
Tasks tend to run over relational databases.
Huge market, but not the only market
External data
Data that an organisation doesn’t own.
E.g. open data, web data, sensor data, mark data, social media data, …
More voluminous, more diverse, more rapidly changing than internal data, may be new sources of internal data, e.g. from sensors
As well as data, the web provides access potentially huge user communities, leading to highly variable application demands
Distributed database management system
Data is stored in several databases, but can be queried from any of several places
The DDBMS takes responsibility for evaluating queries efficiently over the distributed sources.
May also provide distributed transactions, through two-phase commit.
If data is already in different databases we are actually doing data integration
Horizontal Partitioning
Storing different rows of a database in seperate tables
Vertical Partitioning
Storing different columns of data in different tables
Global Schema
A schema that unites local schema of partitions. (my own definition)
May hide the location of the data
Can be built from local schemas using SQL Views
Parallel databases
Data is stored and queried across several machines, for performance
Different architectures exist but “shared nothing” is most common
Can be used to speed up online transaction processing (OLTP) and online analytical processing (OLAP) tasks
Data Integration
The processes of combining data from existing sources in a way that reduces heterogeneities.
May leave the data where it is (using views) or copy the data from sources.
Where the data is copied ETL (Extract, Transform and Load) workflows are often used
Features provided by relational model
Data Independence
Declarative Querying
Explicit Schema
Transactional Guarantees
Things relational architectures can’t handle
Batch - offline processing of huge data sets, such as web crawls or logs
Interactive - online processing of customer requests at web scale, such as for shopping or gaming
Streaming - Data arriving at speed in real time, often in need of a timely response
The 4 big V’s of Big Data
Volume - extreme scale, perhaps defined as at a level that requires parallel/distributed computing
Variety - The data may be in different forms (structured, semi-structured, unstructured, …)
Velocity - The data may be arriving (or changing) quickly
Veracity - The data may be of variable quality
The cloud is increasingly where data management is taking place
Existing, relational, databases are migrating.
NoSQL databases are often intrinsically elastic
ETL often involves multiple cloud resources
DBaaS
Database-as-a-service provides cloud hosting of database systems
DBaaS supported platforms
DBaaS supports the full range of database platforms:
Established relational vendors - migrating customers from on-premise to clouds
NoSQL vendors - for web-scale simple-requests applications that require elasticity
Cloud data warehouses, such as Snowflake
DBaaS associated tooling
Matillion for ETL in the cloud
Database migration tools, on-premise to cloud