Week 1: Big Data Analytics Flashcards by Henry Cao

Human-generated Data

Examples include social media content, email messages, documents, etc.

How well did you know this?

Not at all

Perfectly

Machine-generated Data

Examples include DBMS log, sensor readings, network traces, etc.

How well did you know this?

Not at all

Perfectly

Structured Data

Sticks to a data model or schema. Can be managed by DBMS. Usually in dataset/table format. Examples include banking transactions, electronic health records.

How well did you know this?

Not at all

Perfectly

Unstructured Data

Doesn’t stick to data model or schema. Data types include textual, binary data. Can be stored as BLOBS (Binary Large Objects) in a DBMS or in NoSQL databases. Examples include tweets, video files.

How well did you know this?

Not at all

Perfectly

Semi-structured Data

Non-relational data with a certain level of structure or consistency. Can be hierarchical or graph-based. Examples include spreadsheets, XML data, sensor data, JSON data

How well did you know this?

Not at all

Perfectly

JSON

An open standard format that uses human-readable text to send data objects made up of attribute-value pairs. Used in MongoDB.

How well did you know this?

Not at all

Perfectly

Metadata

It provides information about a dataset’s characteristics and structure. Examples include XML tags for the author and the creation date of the document. In Linux, common metadata includes, size, permissions, creation date, access date, inode number, file type, and etc. Can be accessed by “ls -la”, “stat” commands in Linux.

How well did you know this?

Not at all

Perfectly

Big Data Characteristic: Volume

Indicates data quantity, which is large and evergrowing. Specialised technologies are needed to store and process large columns of data.

How well did you know this?

Not at all

Perfectly

Big Data Characteristic: Velocity

indicates data speed, which might be high. High-velocity data can change quickly. Velocity can impact elasticity (sensitivity to changes in other variables), and available time for data processing.

How well did you know this?

Not at all

Perfectly

Big Data Characteristic: Variety

Indicates how big data can have multiple formats and types. Some big data might have special requirements for integration, especially how to join and combine the data. It can also affect integration, transformation, processing, and storage requirements.

How well did you know this?

Not at all

Perfectly

Big Data Characteristic: Veracity

Indicates level of bias, noise, abnormalities in big data. As such, removing the noise and invalid values is essential. This process can vary based on different requirements.

How well did you know this?

Not at all

Perfectly

Big Data Characteristic: Value

Indiciates utility and usefulness of the data. For example, if it takes 3 days to predict the price of a stock, there’s no room for day trading.

How well did you know this?

Not at all

Perfectly

Analytics Goal: Descriptive

Focuses on what happened, based on past data presented in a summarised form.

How well did you know this?

Not at all

Perfectly

Analytics Goal: Diagnostic

Focuses on why something happened based on past data.

How well did you know this?

Not at all

Perfectly

Analytics Goal: Predictive

Focuses on what is likely to happen based on existing data.

How well did you know this?

Not at all

Perfectly

Analytics Goal: Prescriptive

Study These Flashcards

Focuses on what can be done to make something happen based on existing data.

Major Computational Task: Basic Statistics

Study These Flashcards

Statistically summarising data. Popular measures include mean, median, variance, count, top-N, distinct values, etc. The goal is descriptive.

Major Computational Task: Linear Algebraic Computation

Study These Flashcards

The result is a model describing the data or a smaller dataset built from the data. The goal is descriptive, diagnostic, and predictive.

Major Computational Task: Generalised N-body Problem

Study These Flashcards

This kind of problem involves finding similarities between data points in the dataset. Examples include clustering and classification. Challenges include high dimensionality. The goal is diagnostic, predictive, and prescriptive.

Major Computational Task: Graph-theoretic Computations

Study These Flashcards

These computations involve data in graph form. Example tasks include searching for nodes and finding the shortest paths. Challenges include high interconnectivity. The goal is diagnostic, predictive, and prescriptive.

Major Computational Task: Optimisation

Study These Flashcards

This involves finding the set of parameters in which the selected objective function is solved. Can be used to find optimal models and validate findings. The goal is prescriptive.

Major Computational Task: Integration

Study These Flashcards

This involves finding the high dimensional integrals of functions. The goal is predictive and prescriptive.

Major Computational Task: Alignment Problems

Study These Flashcards

This involves determining whether two entities are the same. Examples include finding synonyms intext, and seeing if the same entity is present in multiple images. The goal is predictive and prescriptive.

Data Configuration Type: Default

Study These Flashcards

In this type, the dataset is stored in RAM.

Data Configuration Type: Streaming

The data arrives in a stream, with a part/window being stored.

Data Configuration Type: Distributed

The data is distributed over multiple machines, in RAM and/or disk.

Data Configuration Type: Multi-threaded

The data is stored in one machine, and multiple processors share the RAM of the machine.

Week 1: Big Data Analytics Flashcards

(27 cards)