converted csv file Flashcards

Question

What are the components of a Directed Acyclic Graph used in data processing?

Answer 1

* Orchestrator: supervises task execution * Scheduler: coordinates tasks * Executor: runs the tasks * Metadata component: monitors the state of the pipeline

Answer 2

* Raw data are collected at given time intervals (hourly, daily, weekly) * Data is transformed and loaded into centralised storage * Process can be automated to run without human intervention

Answer 3

* Real-time processing * Shortens processing intervals into smaller windows (typically sub seconds) * May use an event driven approach where data are processed as soon as they are available

Answer 4

* Often implemented as a publish-subscribe architecture * Data producing sources notifies the stream processor of new data/events

Answer 5

* Batch data are bound (we know the size and number of data entries before processing) * Stream data are unbound (size and number of data is unknown).

Answer 6

Large data batches

Answer 7

Individual records or microbatches

Answer 8

* Batch processing is high latency * Stream processing is low latency

Answer 9

Complex data processing

Answer 10

Time sensitive data or quick insights

Answer 11

* Data Processing design patterns that supports both batch and stream processing to handle massive amounts of data * Made up of a batch, speed and serving layer

Answer 12

* Abstracts the underlying implementation. * Enables complex analysis with batch processing and quick insights through stream processing

Answer 13

* No serving layer * Batch and speed layers are integrated into a single processing layer which is capable of both batch and stream processing

Answer 14

* Easier to implement than lambda architecture * Simplified architecture * Easier to set up and maintain * More secure as there is less attack surface.

Answer 15

* More resource efficient to use a managed service that setting up, maintaining and administering your own service. * AWS, Azure and Google Cloud Platform have the biggest market share. * Apache also offers solutions through Hadoop

Answer 16

* Azure Synapse – advanced analytics * Azure Data Lake Analytics – analytical services for Azure Data Lake * HDInsight – for Hadoop * Databricks – large scale analytics based on Spark

Answer 17

* Fraud detection * Social media analysis * Log monitoring * Customer analysis

Answer 18

* HDInsight with Spark or Stream * Databricks * Azure Stream Analytics * Azure Functions Azure App Service Webjobs * IoT and Event Hubs

Answer 19

* Which programming languages are supported * Programming paradigm (declarative or imperative) * Pricing model * Available connectors.

Answer 20

* Per Streaming Unit * Per Active Cluster Hour * Per Executed Function

Answer 21

Systems that generate data

Answer 22

* Storage systems designed to receive data from different sources * Used for processing, analysis or retrieval

Answer 23

* Kinesis Firehose for data ingestion * S3 (Amazon Simple Storage Service) - cloud based distributed file system * AWS Glue for batch processing * Athena: data warehouse * Amazon QuickSight: reporting tool

Answer 24

Holding data in memory or the CPU during program execution

Answer 25

Storing data in hardware devices so that the date is available for future use.

Answer 26

Hard Disk Drives, USB Flash Drives, SD Cards, Solid State Drives or other physical media used as direct attached storage (a storage drive that is directly connected to a computer).

Answer 27

* Storage solutions deployed over a network * An infrastructure of interconnected servers

Answer 28

* Can store large amounts of data * Distribute data across the physical storage of individual machines

Answer 29

Text files or Binary files. The file content of both is a sequence of bits.

Answer 30

* Human readable files * The bits represent characters * Can be stored as plain text (txt) or rich text (rtf) * May also be a CSV table

Answer 31

* Computer readable files * A sequence of bytes, optimised for computer access * Better at data compression – faster access and smaller files * Images, audio, video files (or any file that needs translation by a computer program to read the content)

Answer 32

The process of translating text data into binary data

Answer 33

The process of translating binary data into text data

Answer 34

* Files systems * Data Lakes * Data Warehouses * Relational Databases * NoSQL databases

Answer 35

* Generally store unstructured data objects * Hierarchy of drives, directories, subdirectories and files * Generally focus on sharing and access aspects of small datasets rather than large distributed datasets

Answer 36

A collaborative platform to store and share files remotely

Answer 37

* Automated backups * Synchronisation of data * User-specific file or folder access * Versioning * Data security services * User friendly interface

Answer 38

* Sync * pCloud * iceDrive * Google Drive * Sharepoint * OneDrive * Dropbox

Answer 39

The ability to switch automatically and seamlessly to a reliable backup system

Answer 40

* Amazon S3: distributed storage for objects, with files and metadata stored in buckets * Amazon Elastic File System: used for shared file storage of Elastic Cloud Compute instances (which are cloud-based AWS virtual machines) * Google Cloud Platform provides Cloud Storage as a managed service for test and binary objects (uses same technical approach as HDFS) * Azure Blob Storage: data stored in containers and organised into virtual folders. * Azure File Storage:mountable file system provided shared access * Azure Queues: stream and batch processing * Azure Table Storage:column oriented NoSQL database

Answer 41

* Data is stored in containers * Organised into virtual folders * Interally the data is stored as flat storage

Answer 42

* Large storage repositories for raw and unstructured data * Data is usually integrated from multiple departments and from various data sources * Data are stored with metadata to support different types of analysis * Generally use different stages for different processing levels.

Answer 43

* Bronze stage: holds raw and untreated data * Silver stage: Data are refined and preprocessed * Gold stage: Data are entirely prepared and usable for analysis (such as training machine learning models)

Answer 44

* Azure Data Lake Store: built on Azure Blob Storage * AWS Data Lake Solution: built on S3 * Google offers a Data Lake built on Cloud Storage

Answer 45

* Relational Database Management System -RDBMS * Traditional solution for structured data * Schema based model, data stored in tables * Enforce a scheme on write

Answer 46

* Uses ACID properties * Referential integrity * Uses data normalisation

Answer 47

Enables SQL to perform efficient queries

Answer 48

Attributes

Answer 49

* Atomicity * Consistency * Isolation * Durability

Answer 50

Organising data so that it looks similar in all fields and records

Answer 51

* Microsoft SQL Server * Oracle Database

Answer 52

* MariaDB * PostgreSQL * MySQL * SQLite

Answer 53

* Managed solutions * Partition and distribute the data across clusters

Answer 54

* Automatically scale to the workload * Perform automated backups, security audits and patches.

Answer 55

* Cloud SQL (Google) * Relational Database Service (AWS) * Aurora (AWS) * Azure SQL Database

Answer 56

* Store structured data * Similar to relational databases * Integrate data from various sources, aggregating it in a clear homogeneous format so that the data can be used in analysis

Answer 57

* AWS Redshift * Microsoft Azure Synapse * Google Cloud BigQuery * Snowflake

Answer 58

* Do not enforce a scheme on write - infer a schema on read * More flexible than RDBMS

Answer 59

* Structured data * Semi-structured data * Unstructured data

Answer 60

* Can't be sure that data follows a specific structure * Can be less efficient than RDBMS, especially when it comes to joining data

Answer 61

RDBMS uses referential integrity, and NoSQL database uses flexible schema.

Answer 62

* Key-value oriented * Document oriented * Column oriented * Graph oriented

Answer 63

* Similar to dictionaries, each value is mapped to a key

Answer 64

* Redis * Memcached * etcd * Riak KV * LevelDB * Amazon Simple DB

Answer 65

* Storing user profiles and session information in web apps * Storing contents of shopping carts in e-commerce * Storing product details for e-commerce * Structural information for system maintenance (eg IP forwarding tables) * IoT readings

Answer 66

* Use a key-value approach, but store objects in collections of key-value pairs * Allow nested data structures

Answer 67

* XML * JSON * BSON

Answer 68

* JSON: JavaScript Object Notation file * BSON: a binary JSON file

Answer 69

* MongoDB * Couchbase * CouchDB * Google Cloud Firestore * Rethink DB * Amazon Document DB

Answer 70

* Also called wide column stores * Store records by column, rather than by row * Columns can be grouped by similar access patterns into column families

Answer 71

* Optimised for analysis involving frequent aggregation * Allows for direct access to columns * Efficient data compression

Answer 72

A Keyspace

Answer 73

* Cassandra * Hbase * Microsoft Azure Table Storage * Amazon Keyspaces

Answer 74

* Designed for heavily interconnected data * Nodes represent entities and edges represent the relationships between entities * Uses directional connections to show relationships

Answer 75

* Topological maps * Routing systems (ie GPS) * Social networks

Answer 76

* Neo4j (most popular) * JanusGraph * TigerGraph * NebulaGraph

Answer 77

* Some databases provide multi model APIs * Many databases can be categorised with a main model, but also provide a secondary model

Answer 78

* Provides APIs for classic SQL, MongoDB, Cassandra or Gremlin (with a graph based approach) * Supports multiple data models such as document oriented, column oriented, key value oriented and graph oriented.

Answer 79

* Amazon DynamoDB * MarkLogic * Aerospike * Google Cloud BigTable * Ignite * ArangoDB * OrientDB * Apache Drill * Amazon Neptune * Apache Druid * Graph DB.

Answer 80

* Gaining insights into the data, extracting information and knowledge from it

Answer 81

*Descriptive * Prescriptive * Predictive

Answer 82

* The explanation of past or present events * Uses statistical analysis of historical data * Data driven models can be used for more detailed analysis such as root-cause analysis

Answer 83

* Predicts future events e.g stock prices or customer churn * Data driven models are constructed based on historical data to learn underlying patterns. These models can project past events in the future to predict future occurrences eg stock prices based on past stock prices

Answer 84

* Investigates the outcomes of different scenarios from models * Recommends decisions leading to the most favourable predicted future event * Used in climate impact research

Answer 85

* Machine Learning * Deep Learning * Time series analysis

Answer 86

Artificial intelligence is the wider field of technology, and machine learning is a part of this. Deep learning is a specific field of machine learning.

Answer 87

* Automatic extraction of informative patterns, without explicit instructions of how to carry this out. * Creates data driven models and is often carried out on tabular and structured data stored in a data lake or warehouse

Answer 88

* Rows represent observation and are called samples or data points * Columns represent attributes and are called features * The "to be predicted"� column is called the label

Answer 89

* Supervised learning * Unsupervised learning

Answer 90

* Using a subset of the data to train the model * Using the remaining data to test the quality of the model's predictions * This process is called data partitioning into a training and testing set

Answer 91

* Samples with known labels are used to train a model to predict the label for unseen samples * Labels are the explicit assignment of information to each record * The model leans the relationship between the features and the target variable using the labelled dataset

Answer 92

* Called the training phase * The model is unstructured * Once the model is constructed it can be used to predict target variables through inference.

Answer 93

* Classification: using past data to predict future events, eg spam filtering * Regression: algorithms to predict numbers eg tomorrow's air temperature

Answer 94

* Logistic regression * Decision trees * Random forest * Support Vector Machines (SVM) * Naïve Bayes * K Nearest Neighbours (KNN) * Gradient boosting

Answer 95

* Linear regression * Ridge * Lasso * Elastic Net Regression * Decision Trees * Random Forest * Support Vector Regression (SVR)

Answer 96

* Uses algorithms to discover underlying patterns in unlabelled data

Answer 97

* Clustering * Anomaly detection * Dimensionality reduction

Answer 98

* Assigning data points to clusters that are not known before the analysis

Answer 99

* K-Means * Hierarchical Clustering * DBSCAN

Answer 100

* Data points assigned to either a "regular"� or "unregular"� cluster

Answer 101

* Fraud detection * Intrusion detection

Answer 102

* Local Outlier Factor (LoF) * One-Class SVMs * Isolation Forests

Answer 103

* Transforming datasets to reduce the number of columns while preserving the information as much as possible * Used to make large datasets with many columns easier to understand

Answer 104

* Principal Component Analysis (PCA) * t-SNE * Linear Discriminant Analysis (LDA)

Answer 105

* Contains hidden layers - deep learning is where there are a number of hidden layers present * There is no defined or fixed number of hidden layers that must be present * Highly parallel

Answer 106

* Uses back propagation algorithms to train neural networks to find suitable weights

Answer 107

GPUs due to their multiple processing units

Answer 108

*Artificial Neural Networks *Convolutional Neural Networks

Answer 109

* Inspired by biological neural networks * Made up of nodes and connections * Positive weights indicate strong connections between nodes * Negative weights are used to discourage connections * Basic unit is a perceptron

Answer 110

* Multiple weighted inputs are added together * Outputs are activated if a threshold is met

Answer 111

* Concatenating multiple layers * Using certain activation functions

Answer 112

* More complex * Used to solve nonlinear separate problems * Additional layers are called hidden layers

Answer 113

* Performant solutions for object recognition in images * If a CNN has multiple layers it can perform automatic extraction of informative features

Answer 114

* Not strictly deep learning * Used for optimised decision making * Rewards agents for interacting with the simulation environment * Can use deep learning algorithms as well as other algorithms

Answer 115

* Re-training an existing model to match a particular use case * The last few layers of the model are removed and retrained with the new data

Answer 116

* Situations with a lot of training data or non-tabular data such as images or videos * This is because of the advances in image processing, natural language processing and automatic feature learning * If data is scarce or strictly tabular a different algorithm may be a better choice

Answer 117

* Analysing data indexed by time * Learns patterns in historical data to forecast future values * The time index is important for data management and should be treated as such in the data system

Answer 118

* Holt-Winters Smoothing * Autoregressive Moving Average (ARMA) models * Extensions like ARIMA, SARIMA and SARIMAX models * Ensemble models like TBATS

Answer 119

* Machine Learning uses multiple features to predict labels * Time Series Analysis uses one-time indexed variables to predict future values * Automated feature extraction libraries for Time Series Analysis have been adapted for use in machine learning.

Answer 120

* Machine Learning Operations * Frameworks used to streamline machine learning applications

Answer 121

* Cross Industry Standard for Data Mining * Canonical framework for development of machine learning models * Requires collaboration between data management and machine learning teams

Answer 122

* Business understanding * Data understanding * Data preparation * Modelling * Evaluation * Deployment

Answer 123

* The final phase of the data processing lifecycle * Shows insights from aggregated information * Uses Business Intelligence (BI)

Answer 124

* Metrics * KPIs

Answer 125

* Visualisations * Dashboards * Text reports

Answer 126

Develop understanding and improve decisions based on information rather than intuition or traditional rules

Answer 127

* Processes and tools for data analysis to gain actionable insights about an organisation’s operations * Looks at internal and external sources of data to reflect the organisation’s reality

Answer 128

* Data-driven decisions * Business decisions

Answer 129

* Decision making based on evidence * Real time analytics * Details about processes * Discovery of new business opportunities * Control planning commitments * Monitoring using KPIs

Answer 130

* Designed to be easy to use * Address data integration and analysis * Offer a GUI for data integration to create ETL processes * Supports creation of dashboard and interactive reports * Focus on visual representation and interactive data exploration

Answer 131

* Tableau * Microsoft Power BI * Qlick Sense

converted csv file Flashcards

testing only (156 cards)