Big Data - Week 6 Flashcards
Enterprise Data
Operational data of an organisation
Typically managed by enterprise information systems that cover both mainstream tasks and analyses over them.
Traditionally analytics over operational data uses data warehouses, with data extracted and reorganised to support reporting.
Tasks tend to run over relational databases.
Huge market, but not the only market
External data
Data that an organisation doesn’t own.
E.g. open data, web data, sensor data, mark data, social media data, …
More voluminous, more diverse, more rapidly changing than internal data, may be new sources of internal data, e.g. from sensors
As well as data, the web provides access potentially huge user communities, leading to highly variable application demands
Distributed database management system
Data is stored in several databases, but can be queried from any of several places
The DDBMS takes responsibility for evaluating queries efficiently over the distributed sources.
May also provide distributed transactions, through two-phase commit.
If data is already in different databases we are actually doing data integration
Horizontal Partitioning
Storing different rows of a database in seperate tables
Vertical Partitioning
Storing different columns of data in different tables
Global Schema
A schema that unites local schema of partitions. (my own definition)
May hide the location of the data
Can be built from local schemas using SQL Views
Parallel databases
Data is stored and queried across several machines, for performance
Different architectures exist but “shared nothing” is most common
Can be used to speed up online transaction processing (OLTP) and online analytical processing (OLAP) tasks
Data Integration
The processes of combining data from existing sources in a way that reduces heterogeneities.
May leave the data where it is (using views) or copy the data from sources.
Where the data is copied ETL (Extract, Transform and Load) workflows are often used
Features provided by relational model
Data Independence
Declarative Querying
Explicit Schema
Transactional Guarantees
Things relational architectures can’t handle
Batch - offline processing of huge data sets, such as web crawls or logs
Interactive - online processing of customer requests at web scale, such as for shopping or gaming
Streaming - Data arriving at speed in real time, often in need of a timely response
The 4 big V’s of Big Data
Volume - extreme scale, perhaps defined as at a level that requires parallel/distributed computing
Variety - The data may be in different forms (structured, semi-structured, unstructured, …)
Velocity - The data may be arriving (or changing) quickly
Veracity - The data may be of variable quality
The cloud is increasingly where data management is taking place
Existing, relational, databases are migrating.
NoSQL databases are often intrinsically elastic
ETL often involves multiple cloud resources
DBaaS
Database-as-a-service provides cloud hosting of database systems
DBaaS supported platforms
DBaaS supports the full range of database platforms:
Established relational vendors - migrating customers from on-premise to clouds
NoSQL vendors - for web-scale simple-requests applications that require elasticity
Cloud data warehouses, such as Snowflake
DBaaS associated tooling
Matillion for ETL in the cloud
Database migration tools, on-premise to cloud
Distributed data challenges / Kleppman issues for the development of data-intensive applications
Reliability - Ability to continue working when things go wrong
Scalability - Ability to cope with increased load
Maintainability - making it easy for operations teams and accommodating evolvability
Reliability
Ability to continue working when things go wrong
Hardware faults: machine, network, disks;
Likely this needs redundancy:
- replicated data, ability to fail over or accomodate missing nodes
Software errors:
Can it keep running if software on a node crashes or hangs? Will data be lost / corrupted? What will the user experience?
MapReduce and a NoSQL database are designed to keep working through faults
Scalability
Ability to cope with increased load (e.g. in terms of numbers of requests or amounts of data)
Describing the performance of a system
Median / Mean response time .. but they don’t tell you how bad things can get
Percentiles, so the 95% measure indicates response times 95% of your users will be within.
Distribution
Typically used to support both reliability and scalability
Partitioning - spreading data across multiple machines to allow parallel processing (and thus scalability)
Replication - Keeping multiple copies of data, to support increased availability and scalability