Big Data - Week 6 Flashcards

1
Q

Enterprise Data

A

Operational data of an organisation

Typically managed by enterprise information systems that cover both mainstream tasks and analyses over them.

Traditionally analytics over operational data uses data warehouses, with data extracted and reorganised to support reporting.

Tasks tend to run over relational databases.

Huge market, but not the only market

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

External data

A

Data that an organisation doesn’t own.

E.g. open data, web data, sensor data, mark data, social media data, …

More voluminous, more diverse, more rapidly changing than internal data, may be new sources of internal data, e.g. from sensors

As well as data, the web provides access potentially huge user communities, leading to highly variable application demands

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Distributed database management system

A

Data is stored in several databases, but can be queried from any of several places

The DDBMS takes responsibility for evaluating queries efficiently over the distributed sources.

May also provide distributed transactions, through two-phase commit.

If data is already in different databases we are actually doing data integration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Horizontal Partitioning

A

Storing different rows of a database in seperate tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Vertical Partitioning

A

Storing different columns of data in different tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Global Schema

A

A schema that unites local schema of partitions. (my own definition)

May hide the location of the data

Can be built from local schemas using SQL Views

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Parallel databases

A

Data is stored and queried across several machines, for performance

Different architectures exist but “shared nothing” is most common

Can be used to speed up online transaction processing (OLTP) and online analytical processing (OLAP) tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data Integration

A

The processes of combining data from existing sources in a way that reduces heterogeneities.

May leave the data where it is (using views) or copy the data from sources.

Where the data is copied ETL (Extract, Transform and Load) workflows are often used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Features provided by relational model

A

Data Independence
Declarative Querying
Explicit Schema
Transactional Guarantees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Things relational architectures can’t handle

A

Batch - offline processing of huge data sets, such as web crawls or logs

Interactive - online processing of customer requests at web scale, such as for shopping or gaming

Streaming - Data arriving at speed in real time, often in need of a timely response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The 4 big V’s of Big Data

A

Volume - extreme scale, perhaps defined as at a level that requires parallel/distributed computing

Variety - The data may be in different forms (structured, semi-structured, unstructured, …)

Velocity - The data may be arriving (or changing) quickly

Veracity - The data may be of variable quality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The cloud is increasingly where data management is taking place

A

Existing, relational, databases are migrating.

NoSQL databases are often intrinsically elastic

ETL often involves multiple cloud resources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

DBaaS

A

Database-as-a-service provides cloud hosting of database systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

DBaaS supported platforms

A

DBaaS supports the full range of database platforms:

Established relational vendors - migrating customers from on-premise to clouds

NoSQL vendors - for web-scale simple-requests applications that require elasticity

Cloud data warehouses, such as Snowflake

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

DBaaS associated tooling

A

Matillion for ETL in the cloud

Database migration tools, on-premise to cloud

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Distributed data challenges / Kleppman issues for the development of data-intensive applications

A

Reliability - Ability to continue working when things go wrong

Scalability - Ability to cope with increased load

Maintainability - making it easy for operations teams and accommodating evolvability

17
Q

Reliability

A

Ability to continue working when things go wrong

Hardware faults: machine, network, disks;
Likely this needs redundancy:
- replicated data, ability to fail over or accomodate missing nodes

Software errors:
Can it keep running if software on a node crashes or hangs? Will data be lost / corrupted? What will the user experience?

MapReduce and a NoSQL database are designed to keep working through faults

18
Q

Scalability

A

Ability to cope with increased load (e.g. in terms of numbers of requests or amounts of data)

19
Q

Describing the performance of a system

A

Median / Mean response time .. but they don’t tell you how bad things can get

Percentiles, so the 95% measure indicates response times 95% of your users will be within.

20
Q

Distribution

A

Typically used to support both reliability and scalability

Partitioning - spreading data across multiple machines to allow parallel processing (and thus scalability)

Replication - Keeping multiple copies of data, to support increased availability and scalability