Week 1: Big Data Analytics Flashcards

1
Q

Human-generated Data

A

Examples include social media content, email messages, documents, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Machine-generated Data

A

Examples include DBMS log, sensor readings, network traces, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Structured Data

A

Sticks to a data model or schema. Can be managed by DBMS. Usually in dataset/table format. Examples include banking transactions, electronic health records.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Unstructured Data

A

Doesn’t stick to data model or schema. Data types include textual, binary data. Can be stored as BLOBS (Binary Large Objects) in a DBMS or in NoSQL databases. Examples include tweets, video files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Semi-structured Data

A

Non-relational data with a certain level of structure or consistency. Can be hierarchical or graph-based. Examples include spreadsheets, XML data, sensor data, JSON data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

JSON

A

An open standard format that uses human-readable text to send data objects made up of attribute-value pairs. Used in MongoDB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Metadata

A

It provides information about a dataset’s characteristics and structure. Examples include XML tags for the author and the creation date of the document. In Linux, common metadata includes, size, permissions, creation date, access date, inode number, file type, and etc. Can be accessed by “ls -la”, “stat” commands in Linux.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Big Data Characteristic: Volume

A

Indicates data quantity, which is large and evergrowing. Specialised technologies are needed to store and process large columns of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Big Data Characteristic: Velocity

A

indicates data speed, which might be high. High-velocity data can change quickly. Velocity can impact elasticity (sensitivity to changes in other variables), and available time for data processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Big Data Characteristic: Variety

A

Indicates how big data can have multiple formats and types. Some big data might have special requirements for integration, especially how to join and combine the data. It can also affect integration, transformation, processing, and storage requirements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Big Data Characteristic: Veracity

A

Indicates level of bias, noise, abnormalities in big data. As such, removing the noise and invalid values is essential. This process can vary based on different requirements.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Big Data Characteristic: Value

A

Indiciates utility and usefulness of the data. For example, if it takes 3 days to predict the price of a stock, there’s no room for day trading.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Analytics Goal: Descriptive

A

Focuses on what happened, based on past data presented in a summarised form.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Analytics Goal: Diagnostic

A

Focuses on why something happened based on past data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Analytics Goal: Predictive

A

Focuses on what is likely to happen based on existing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Analytics Goal: Prescriptive

A

Focuses on what can be done to make something happen based on existing data.

17
Q

Major Computational Task: Basic Statistics

A

Statistically summarising data. Popular measures include mean, median, variance, count, top-N, distinct values, etc. The goal is descriptive.

18
Q

Major Computational Task: Linear Algebraic Computation

A

The result is a model describing the data or a smaller dataset built from the data. The goal is descriptive, diagnostic, and predictive.

19
Q

Major Computational Task: Generalised N-body Problem

A

This kind of problem involves finding similarities between data points in the dataset. Examples include clustering and classification. Challenges include high dimensionality. The goal is diagnostic, predictive, and prescriptive.

20
Q

Major Computational Task: Graph-theoretic Computations

A

These computations involve data in graph form. Example tasks include searching for nodes and finding the shortest paths. Challenges include high interconnectivity. The goal is diagnostic, predictive, and prescriptive.

21
Q

Major Computational Task: Optimisation

A

This involves finding the set of parameters in which the selected objective function is solved. Can be used to find optimal models and validate findings. The goal is prescriptive.

22
Q

Major Computational Task: Integration

A

This involves finding the high dimensional integrals of functions. The goal is predictive and prescriptive.

23
Q

Major Computational Task: Alignment Problems

A

This involves determining whether two entities are the same. Examples include finding synonyms intext, and seeing if the same entity is present in multiple images. The goal is predictive and prescriptive.

24
Q

Data Configuration Type: Default

A

In this type, the dataset is stored in RAM.

25
Q

Data Configuration Type: Streaming

A

The data arrives in a stream, with a part/window being stored.

26
Q

Data Configuration Type: Distributed

A

The data is distributed over multiple machines, in RAM and/or disk.

27
Q

Data Configuration Type: Multi-threaded

A

The data is stored in one machine, and multiple processors share the RAM of the machine.