Week 1: Big Data Analytics Flashcards
Human-generated Data
Examples include social media content, email messages, documents, etc.
Machine-generated Data
Examples include DBMS log, sensor readings, network traces, etc.
Structured Data
Sticks to a data model or schema. Can be managed by DBMS. Usually in dataset/table format. Examples include banking transactions, electronic health records.
Unstructured Data
Doesn’t stick to data model or schema. Data types include textual, binary data. Can be stored as BLOBS (Binary Large Objects) in a DBMS or in NoSQL databases. Examples include tweets, video files.
Semi-structured Data
Non-relational data with a certain level of structure or consistency. Can be hierarchical or graph-based. Examples include spreadsheets, XML data, sensor data, JSON data
JSON
An open standard format that uses human-readable text to send data objects made up of attribute-value pairs. Used in MongoDB.
Metadata
It provides information about a dataset’s characteristics and structure. Examples include XML tags for the author and the creation date of the document. In Linux, common metadata includes, size, permissions, creation date, access date, inode number, file type, and etc. Can be accessed by “ls -la”, “stat” commands in Linux.
Big Data Characteristic: Volume
Indicates data quantity, which is large and evergrowing. Specialised technologies are needed to store and process large columns of data.
Big Data Characteristic: Velocity
indicates data speed, which might be high. High-velocity data can change quickly. Velocity can impact elasticity (sensitivity to changes in other variables), and available time for data processing.
Big Data Characteristic: Variety
Indicates how big data can have multiple formats and types. Some big data might have special requirements for integration, especially how to join and combine the data. It can also affect integration, transformation, processing, and storage requirements.
Big Data Characteristic: Veracity
Indicates level of bias, noise, abnormalities in big data. As such, removing the noise and invalid values is essential. This process can vary based on different requirements.
Big Data Characteristic: Value
Indiciates utility and usefulness of the data. For example, if it takes 3 days to predict the price of a stock, there’s no room for day trading.
Analytics Goal: Descriptive
Focuses on what happened, based on past data presented in a summarised form.
Analytics Goal: Diagnostic
Focuses on why something happened based on past data.
Analytics Goal: Predictive
Focuses on what is likely to happen based on existing data.