Big Data - Week 6 Flashcards

Question 1

Q

Enterprise Data

Answer

A

Operational data of an organisation

Typically managed by enterprise information systems that cover both mainstream tasks and analyses over them.

Traditionally analytics over operational data uses data warehouses, with data extracted and reorganised to support reporting.

Tasks tend to run over relational databases.

Huge market, but not the only market

Question 2

Q

External data

Answer

A

Data that an organisation doesn’t own.

E.g. open data, web data, sensor data, mark data, social media data, …

More voluminous, more diverse, more rapidly changing than internal data, may be new sources of internal data, e.g. from sensors

As well as data, the web provides access potentially huge user communities, leading to highly variable application demands

Question 3

Q

Distributed database management system

Answer

A

Data is stored in several databases, but can be queried from any of several places

The DDBMS takes responsibility for evaluating queries efficiently over the distributed sources.

May also provide distributed transactions, through two-phase commit.

If data is already in different databases we are actually doing data integration

Question 4

Q

Horizontal Partitioning

Answer

A

Storing different rows of a database in seperate tables

Question 5

Q

Vertical Partitioning

Answer

A

Storing different columns of data in different tables

Question 6

Q

Global Schema

Answer

A

A schema that unites local schema of partitions. (my own definition)

May hide the location of the data

Can be built from local schemas using SQL Views

Question 7

Q

Parallel databases

Answer

A

Data is stored and queried across several machines, for performance

Different architectures exist but “shared nothing” is most common

Can be used to speed up online transaction processing (OLTP) and online analytical processing (OLAP) tasks

Question 8

Q

Data Integration

Answer

A

The processes of combining data from existing sources in a way that reduces heterogeneities.

May leave the data where it is (using views) or copy the data from sources.

Where the data is copied ETL (Extract, Transform and Load) workflows are often used

Question 9

Q

Features provided by relational model

Answer

A

Data Independence
Declarative Querying
Explicit Schema
Transactional Guarantees

Question 10

Q

Things relational architectures can’t handle

Answer

A

Batch - offline processing of huge data sets, such as web crawls or logs

Interactive - online processing of customer requests at web scale, such as for shopping or gaming

Streaming - Data arriving at speed in real time, often in need of a timely response

Question 11

Q

The 4 big V’s of Big Data

Answer

A

Volume - extreme scale, perhaps defined as at a level that requires parallel/distributed computing

Variety - The data may be in different forms (structured, semi-structured, unstructured, …)

Velocity - The data may be arriving (or changing) quickly

Veracity - The data may be of variable quality

Question 12

Q

The cloud is increasingly where data management is taking place

Answer

A

Existing, relational, databases are migrating.

NoSQL databases are often intrinsically elastic

ETL often involves multiple cloud resources

Question 13

Q

DBaaS

Answer

A

Database-as-a-service provides cloud hosting of database systems

Question 14

Q

DBaaS supported platforms

Answer

A

DBaaS supports the full range of database platforms:

Established relational vendors - migrating customers from on-premise to clouds

NoSQL vendors - for web-scale simple-requests applications that require elasticity

Cloud data warehouses, such as Snowflake

Question 15

Q

DBaaS associated tooling

Answer

A

Matillion for ETL in the cloud

Database migration tools, on-premise to cloud

Question 16

Q

Distributed data challenges / Kleppman issues for the development of data-intensive applications

Answer

Study These Flashcards

A

Reliability - Ability to continue working when things go wrong

Scalability - Ability to cope with increased load

Maintainability - making it easy for operations teams and accommodating evolvability

Question 17

Q

Reliability

Answer

Study These Flashcards

A

Ability to continue working when things go wrong

Hardware faults: machine, network, disks;
Likely this needs redundancy:
- replicated data, ability to fail over or accomodate missing nodes

Software errors:
Can it keep running if software on a node crashes or hangs? Will data be lost / corrupted? What will the user experience?

MapReduce and a NoSQL database are designed to keep working through faults

Question 18

Q

Scalability

Answer

Study These Flashcards

A

Ability to cope with increased load (e.g. in terms of numbers of requests or amounts of data)

Question 19

Q

Describing the performance of a system

Answer

Study These Flashcards

A

Median / Mean response time .. but they don’t tell you how bad things can get

Percentiles, so the 95% measure indicates response times 95% of your users will be within.

Question 20

Q

Distribution

Answer

Study These Flashcards

A

Typically used to support both reliability and scalability

Partitioning - spreading data across multiple machines to allow parallel processing (and thus scalability)

Replication - Keeping multiple copies of data, to support increased availability and scalability

Big Data - Week 6 Flashcards

(20 cards)