Week 1: Big Data Analytics Flashcards
Human-generated Data
Examples include social media content, email messages, documents, etc.
Machine-generated Data
Examples include DBMS log, sensor readings, network traces, etc.
Structured Data
Sticks to a data model or schema. Can be managed by DBMS. Usually in dataset/table format. Examples include banking transactions, electronic health records.
Unstructured Data
Doesn’t stick to data model or schema. Data types include textual, binary data. Can be stored as BLOBS (Binary Large Objects) in a DBMS or in NoSQL databases. Examples include tweets, video files.
Semi-structured Data
Non-relational data with a certain level of structure or consistency. Can be hierarchical or graph-based. Examples include spreadsheets, XML data, sensor data, JSON data
JSON
An open standard format that uses human-readable text to send data objects made up of attribute-value pairs. Used in MongoDB.
Metadata
It provides information about a dataset’s characteristics and structure. Examples include XML tags for the author and the creation date of the document. In Linux, common metadata includes, size, permissions, creation date, access date, inode number, file type, and etc. Can be accessed by “ls -la”, “stat” commands in Linux.
Big Data Characteristic: Volume
Indicates data quantity, which is large and evergrowing. Specialised technologies are needed to store and process large columns of data.
Big Data Characteristic: Velocity
indicates data speed, which might be high. High-velocity data can change quickly. Velocity can impact elasticity (sensitivity to changes in other variables), and available time for data processing.
Big Data Characteristic: Variety
Indicates how big data can have multiple formats and types. Some big data might have special requirements for integration, especially how to join and combine the data. It can also affect integration, transformation, processing, and storage requirements.
Big Data Characteristic: Veracity
Indicates level of bias, noise, abnormalities in big data. As such, removing the noise and invalid values is essential. This process can vary based on different requirements.
Big Data Characteristic: Value
Indiciates utility and usefulness of the data. For example, if it takes 3 days to predict the price of a stock, there’s no room for day trading.
Analytics Goal: Descriptive
Focuses on what happened, based on past data presented in a summarised form.
Analytics Goal: Diagnostic
Focuses on why something happened based on past data.
Analytics Goal: Predictive
Focuses on what is likely to happen based on existing data.
Analytics Goal: Prescriptive
Focuses on what can be done to make something happen based on existing data.
Major Computational Task: Basic Statistics
Statistically summarising data. Popular measures include mean, median, variance, count, top-N, distinct values, etc. The goal is descriptive.
Major Computational Task: Linear Algebraic Computation
The result is a model describing the data or a smaller dataset built from the data. The goal is descriptive, diagnostic, and predictive.
Major Computational Task: Generalised N-body Problem
This kind of problem involves finding similarities between data points in the dataset. Examples include clustering and classification. Challenges include high dimensionality. The goal is diagnostic, predictive, and prescriptive.
Major Computational Task: Graph-theoretic Computations
These computations involve data in graph form. Example tasks include searching for nodes and finding the shortest paths. Challenges include high interconnectivity. The goal is diagnostic, predictive, and prescriptive.
Major Computational Task: Optimisation
This involves finding the set of parameters in which the selected objective function is solved. Can be used to find optimal models and validate findings. The goal is prescriptive.
Major Computational Task: Integration
This involves finding the high dimensional integrals of functions. The goal is predictive and prescriptive.
Major Computational Task: Alignment Problems
This involves determining whether two entities are the same. Examples include finding synonyms intext, and seeing if the same entity is present in multiple images. The goal is predictive and prescriptive.
Data Configuration Type: Default
In this type, the dataset is stored in RAM.
Data Configuration Type: Streaming
The data arrives in a stream, with a part/window being stored.
Data Configuration Type: Distributed
The data is distributed over multiple machines, in RAM and/or disk.
Data Configuration Type: Multi-threaded
The data is stored in one machine, and multiple processors share the RAM of the machine.