Data Processing Lifecycle Flashcards

Section 1

1
Q

What are the five stages of the data processing lifecycle?

A
  • Data ingestion and integration
  • Data Processing
  • Data Storage
  • Data Analysis
  • Reporting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does the first stage of the data processing lifecycle involve?

A
  • Collecting data from a variety of sources, transforming the data (if needed to match the target storage) and then storing in the target storage
  • When data is integrated from multiple sources we have to be aware of data heterogeneity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What types of data make up heterogeneous data?

A
  • XML files
  • JSON Files
  • Weblogs and Cookies
  • SQL Queries
  • Flat files
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Why is heterogeneous data increasing?

A

Rise in technology increasing data produced

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is data heterogeneity?

A
  • Data made up of different types, sources, structures or formats.
  • May be in a structured, unstructured or semi-structured format
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is structured data?

A
  • Conforms to a well-defined schema - schema on write
  • Often tabular (rows are datapoints and columns are attributes)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Where is structured data stored?

A
  • Relational Databases
  • Data Warehouses
  • Legacy Data Systems
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is semi-structured data?

A
  • Scheme is not completely defined by a data model - no schema on write
  • The size, order and contents of the elements can be different
  • Often used with IoT devices
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What formats might semi-structured data be in and where might they be used?

A
  • HTML files
  • XML files
  • JSON files
  • Often used in IoT devices
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is unstructured data?

A
  • No formal description of schema - no schema on write
  • Human readable, requires pre-processing for a computer to extract information.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What types of file formats make up unstructured data?

A
  • Text files
  • Image Files
    *Video Files
  • Audio Files
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are data ingestion and integration frameworks?

A
  • Often carried out as a single technical solution
  • Been used in data warehouses for a long time
  • Uses the ETL process
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the ETL process?

A
  • Extract: collects raw data, often in a structured format
  • Transform: processes the data into a format matching the end destination
  • Load: stores the transformed data into its new storage location
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What might be carried out during the transformation process of ETL?

A
  • Data cleaning
    *Data enrichment
  • Feature engineering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Where is ETL traditionally used?

A

Batch processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What other approaches can be used instead of ETL?

A
  • IoT Hubs
  • Digital Twins
  • Data Pipeline Orchestrators
  • Bulk import tools
  • Data Streaming Platforms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the requirements for a data integration tool?

A
  • Different protocols to support data collection from different sources
  • Support for integration processes across different hardware and operating systems
  • Scalability and adaptability
  • Integrated capabilities for transformation operations (including fundamental and complex transformations)
  • Security mechanisms to protect the data in the pipeline
  • Visualisation of data flow (not necessary, but offered by many tools)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is data variety?

A

The diversity of data types

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is data veracity?

A

The level of trust in the collected data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is meant by data velocity?

A

Tthe speed of data generation and its movement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are challenges with data integration and ingestion?

A
  • The increasing variety and veracity of data
  • Processing large amounts of data with high velocity
  • New requirements due to increased interest in new technologies
  • Cybersecurity - data needs to be secure, trusted and accountable
  • Encryption to protect data during transport
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How do data processing frameworks work?

A
  • Distribute storage and processing over several nodes
  • Transform data in several steps
  • Can efficiently store, access and process large amounts of data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How is the transformation process generally modelled in a data processing framework?

A

As a Directed Acyclic Graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How are Directed Acyclic Graphs used in data processing?

A
  • Each stage has an input and an output
  • Inputs can be used for multiple tasks so dependencies are clearly defined and there are no feedback loops
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are the components of a Directed Acyclic Graph used in data processing?

A
  • Orchestrator: supervises task execution
  • Scheduler: coordinates tasks
  • Executor: runs the tasks
  • Metadata component: monitors the state of the pipeline
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is batch processing?

A
  • Raw data are collected at given time intervals (hourly, daily, weekly)
  • Data is transformed and loaded into centralised storage
  • Process can be automated to run without human intervention
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is stream processing?

A
  • Real-time processing
  • Shortens processing intervals into smaller windows (typically sub seconds)
  • May use an event driven approach where data are processed as soon as they are available
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the event driven approach in stream processing?

A
  • Often implemented as a publish-subscribe architecture
  • Data producing sources notifies the stream processor of new data/events
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the difference between batch processing and stream processing?

A
  • Batch data are bound (we know the size and number of data entries before processing)
  • Stream data are unbound (size and number of data is unknown).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What data batch size does batch processing handle?

A

Large data batches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What data batch size does stream processing handle?

A

Individual records or microbatches

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is the latency of batch and stream processing?

A
  • Batch processing is high latency
  • Stream processing is low latency
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What type of tasks is batch processing used for?

A

Complex data processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What types of tasks is stream processing used for?

A

Time sensitive data or quick insights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is a Lambda architecture?

A
  • Data Processing design patterns that supports both batch and stream processing to handle massive amounts of data
  • Made up of a batch, speed and serving layer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How does a Lambda architecture work?

A
  • Abstracts the underlying implementation.
  • Enables complex analysis with batch processing and quick insights through stream processing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is Kappa Architecture and how does it work?

A
  • No serving layer
  • Batch and speed layers are integrated into a single processing layer which is capable of both batch and stream processing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What are the benefits of using Kappa architecture?

A
  • Easier to implement than lambda architecture
  • Simplified architecture
  • Easier to set up and maintain
  • More secure as there is less attack surface.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Is it more efficient to set up your own processing architecture or use a managed service?

A
  • More resource efficient to use a managed service that setting up, maintaining and administering your own service.
  • AWS, Azure and Google Cloud Platform have the biggest market share.
  • Apache also offers solutions through Hadoop
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What services does Azure offer for batch processing?

A
  • Azure Synapse - advanced analytics
  • Azure Data Lake Analytics - analytical services for Azure Data Lake
  • HDInsight - for Hadoop
  • Databricks - large scale analytics based on Spark
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What are examples of use cases for stream processing?

A
  • Fraud detection
  • Social media analysis
  • Log monitoring
  • Customer analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What services does Azure offer for stream processing?

A
  • HDInsight with Spark or Stream
  • Databricks
  • Azure Stream Analytics
  • Azure Functions Azure App Service Webjobs
  • IoT and Event Hubs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What do you need to consider when choosing a stream processing solution?

A
  • Which programming languages are supported
  • Programming paradigm (declarative or imperative)
  • Pricing model
  • Available connectors.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What are the different pricing models for stream processing?

A
  • Per Streaming Unit
  • Per Active Cluster Hour
  • Per Executed Function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What are data sources?

A

Systems that generate data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What are data sinks?

A
  • Storage systems designed to receive data from different sources
  • Used for processing, analysis or retrieval
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What components are available for lambda architecture on AWS?

A
  • Kinesis Firehose for data ingestion
  • S3 (Amazon Simple Storage Service) - cloud based distributed file system
  • AWS Glue for batch processing
  • Athena: data warehouse
  • Amazon QuickSight: reporting tool
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is primary data storage?

A

Holding data in memory or the CPU during program execution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is secondary data storage?

A

Storing data in hardware devices so that the date is available for future use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What is physical secondary storage?

A

Hard Disk Drives, USB Flash Drives, SD Cards, Solid State Drives or other physical media used as direct attached storage (a storage drive that is directly connected to a computer).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What is cloud storage?

A
  • Storage solutions deployed over a network
  • An infrastructure of interconnected servers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Why is cloud storage suited for data intensive applications?

A
  • Can store large amounts of data
  • Distribute data across the physical storage of individual machines
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

What are the two ways stored files can be encoded?

A

Text files or Binary files. The file content of both is a sequence of bits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What are text files?

A
  • Human readable files
  • The bits represent characters
  • Can be stored as plain text (txt) or rich text (rtf)
  • May also be a CSV table
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What are binary files?

A
  • Computer readable files
  • A sequence of bytes, optimised for computer access
  • Better at data compression - faster access and smaller files
  • Images, audio, video files (or any file that needs translation by a computer program to read the content)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What is data serialisation?

A

The process of translating text data into binary data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What is data de-serialisation?

A

The process of translating binary data into text data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What are the five fundamental storage categories?

A
  • Files systems
  • Data Lakes
  • Data Warehouses
  • Relational Databases
  • NoSQL databases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What are local file systems?

A
  • Generally store unstructured data objects
  • Hierarchy of drives, directories, subdirectories and files
  • Generally focus on sharing and access aspects of small datasets rather than large distributed datasets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What are cloud-based file systems?

A

A collaborative platform to store and share files remotely

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What features do cloud-based file systems offer?

A
  • Automated backups
  • Synchronisation of data
  • User-specific file or folder access
  • Versioning
  • Data security services
  • User friendly interface
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What are examples of cloud based file systems?

A
  • Sync
  • pCloud
  • iceDrive
  • Google Drive
  • Sharepoint
  • OneDrive
  • Dropbox
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

What is failover management?

A

The ability to switch automatically and seamlessly to a reliable backup system

64
Q

What are examples of managed cloud file systems?

A
  • Amazon S3: distributed storage for objects, with files and metadata stored in buckets
  • Amazon Elastic File System: used for shared file storage of Elastic Cloud Compute instances (which are cloud-based AWS virtual machines)
  • Google Cloud Platform provides Cloud Storage as a managed service for test and binary objects (uses same technical approach as HDFS)
  • Azure Blob Storage: data stored in containers and organised into virtual folders.
  • Azure File Storage:mountable file system provided shared access
  • Azure Queues: stream and batch processing
  • Azure Table Storage:column oriented NoSQL database
65
Q

Why is Aure Blob Storage easily converted to an Azure Data Lake?

A
  • Data is stored in containers
  • Organised into virtual folders
  • Interally the data is stored as flat storage
66
Q

What are Data Lakes?

A
  • Large storage repositories for raw and unstructured data
  • Data is usually integrated from multiple departments and from various data sources
  • Data are stored with metadata to support different types of analysis
  • Generally use different stages for different processing levels.
67
Q

How does Databricks’ Delta Lake process data in stages?

A
  • Bronze stage: holds raw and untreated data
  • Silver stage: Data are refined and preprocessed
  • Gold stage: Data are entirely prepared and usable for analysis (such as training machine learning models)
68
Q

Which cloud providers offer a Data Lake service built on their distributed storage system?

A
  • Azure Data Lake Store: built on Azure Blob Storage
  • AWS Data Lake Solution: built on S3
  • Google offers a Data Lake built on Cloud Storage
69
Q

What are relational databases?

A
  • Relational Database Management System -RDBMS
  • Traditional solution for structured data
  • Schema based model, data stored in tables
  • Enforce a scheme on write
70
Q

How does a RBDMS guarantee reliable transactions?

A
  • Uses ACID properties
  • Referential integrity
  • Uses data normalisation
71
Q

What does data normalisation in a RDBMS enable?

A

Enables SQL to perform efficient queries

72
Q

What do rows contain in a RDBMS?

A

Records

73
Q

What do columns contain in a RDBMS?

A

Attributes

74
Q

What are ACID properties?

A
  • Atomicity
  • Consistency
  • Isolation
  • Durability
75
Q

What is data normalisation?

A

Organising data so that it looks similar in all fields and records

76
Q

What are examples of commercial RDBMS?

A
  • Microsoft SQL Server
  • Oracle Database
77
Q

What are examples of open-source RDBMS?

A
  • MariaDB
  • PostgreSQL
  • MySQL
  • SQLite
78
Q

What is Database as a Service (DBaaS)?

A
  • Managed solutions
  • Partition and distribute the data across clusters
79
Q

What are the benefits of Database as a Service (DBaaS)?

A
  • Automatically scale to the workload
  • Perform automated backups, security audits and patches.
80
Q

What are examples of Database as a Service solutions?

A
  • Cloud SQL (Google)
  • Relational Database Service (AWS)
  • Aurora (AWS)
  • Azure SQL Database
81
Q

What are data warehouses?

A
  • Store structured data
  • Similar to relational databases
  • Integrate data from various sources, aggregating it in a clear homogeneous format so that the data can be used in analysis
82
Q

What are examples of data warehouse solutions?

A
  • AWS Redshift
  • Microsoft Azure Synapse
  • Google Cloud BigQuery
  • Snowflake
83
Q

What are NoSQL databases?

A
  • Do not enforce a scheme on write - infer a schema on read
  • More flexible than RDBMS
84
Q

What categories of data can be stored in a NoSQL database?

A
  • Structured data
  • Semi-structured data
  • Unstructured data
85
Q

What are the disadvantages of using NoSQL databases?

A
  • Can’t be sure that data follows a specific structure
  • Can be less efficient than RDBMS, especially when it comes to joining data
86
Q

What is the key difference between a RDBMS and NoSQL database?

A

RDBMS uses referential integrity, and NoSQL database uses flexible schema.

87
Q

What are the different types of NoSQL Databases?

A
  • Key-value oriented
  • Document oriented
  • Column oriented
  • Graph oriented
88
Q

What are key value oriented NoSQL databases?

A
  • Similar to dictionaries, each value is mapped to a key
89
Q

What are examples of key value oriented database solutions?

A
  • Redis
  • Memcached
  • etcd
  • Riak KV
  • LevelDB
  • Amazon Simple DB
90
Q

What might key value oriented databases be used for?

A
  • Storing user profiles and session information in web apps
  • Storing contents of shopping carts in e-commerce
  • Storing product details for e-commerce
  • Structural information for system maintenance (eg IP forwarding tables)
  • IoT readings
91
Q

What are document oriented databases?

A
  • Use a key-value approach, but store objects in collections of key-value pairs
  • Allow nested data structures
92
Q

What file types can document oriented databases hold?

A
  • XML
  • JSON
  • BSON
93
Q

What are JSON and BSON files?

A
  • JSON: JavaScript Object Notation file
  • BSON: a binary JSON file
94
Q

What are examples of document oriented database solutions?

A
  • MongoDB
  • Couchbase
  • CouchDB
  • Google Cloud Firestore
  • Rethink DB
  • Amazon Document DB
95
Q

What are column oriented databases?

A
  • Also called wide column stores
  • Store records by column, rather than by row
  • Columns can be grouped by similar access patterns into column families
96
Q

What are the benefits of column oriented databases?

A
  • Optimised for analysis involving frequent aggregation
  • Allows for direct access to columns
  • Efficient data compression
97
Q

What might a table in a column oriented database be referred to as?

A

A Keyspace

98
Q

What are examples of column oriented database solutions?

A
  • Cassandra
  • Hbase
  • Microsoft Azure Table Storage
  • Amazon Keyspaces
99
Q

What are graph oriented databases?

A
  • Designed for heavily interconnected data
  • Nodes represent entities and edges represent the relationships between entities
  • Uses directional connections to show relationships
100
Q

What might graph oriented databases be used for?

A
  • Topological maps
  • Routing systems (ie GPS)
  • Social networks
101
Q

What are examples of graph oriented database solutions?

A
  • Neo4j (most popular)
  • JanusGraph
  • TigerGraph
  • NebulaGraph
102
Q

What are multi-model databases?

A
  • Some databases provide multi model APIs
  • Many databases can be categorised with a main model, but also provide a secondary model
103
Q

How does Azure CosmosDB provide a multi-model approach?

A
  • Provides APIs for classic SQL, MongoDB, Cassandra or Gremlin (with a graph based approach)
  • Supports multiple data models such as document oriented, column oriented, key value oriented and graph oriented.
104
Q

What database solutions have a multi-model approach?

A
  • Amazon DynamoDB
  • MarkLogic
  • Aerospike
  • Google Cloud BigTable
  • Ignite
  • ArangoDB
  • OrientDB
  • Apache Drill
  • Amazon Neptune
  • Apache Druid
  • Graph DB.
105
Q

What is the purpose of data analysis?

A
  • Gaining insights into the data, extracting information and knowledge from it
106
Q

What are the different types of data analysis?

A

*Descriptive
* Prescriptive
* Predictive

107
Q

What is descriptive data analysis?

A
  • The explanation of past or present events
  • Uses statistical analysis of historical data
  • Data driven models can be used for more detailed analysis such as root-cause analysis
108
Q

What is predictive data analysis?

A
  • Predicts future events e.g stock prices or customer churn
  • Data driven models are constructed based on historical data to learn underlying patterns. These models can project past events in the future to predict future occurrences eg stock prices based on past stock prices
109
Q

What is prescriptive data analysis?

A
  • Investigates the outcomes of different scenarios from models
  • Recommends decisions leading to the most favourable predicted future event
  • Used in climate impact research
110
Q

What are some categories of data analysis?

A
  • Machine Learning
  • Deep Learning
  • Time series analysis
111
Q

What is the difference between artificial intelligence, machine learning and deep learning?

A

Artificial intelligence is the wider field of technology, and machine learning is a part of this. Deep learning is a specific field of machine learning.

112
Q

What is machine learning?

A
  • Automatic extraction of informative patterns, without explicit instructions of how to carry this out.
  • Creates data driven models and is often carried out on tabular and structured data stored in a data lake or warehouse
113
Q

What are the components of machine learning?

A
  • Rows represent observation and are called samples or data points
  • Columns represent attributes and are called features
  • The “to be predicted” column is called the label
114
Q

What are the two types of machine learning?

A
  • Supervised learning
  • Unsupervised learning
115
Q

What is a common approach to machine learning?

A
  • Using a subset of the data to train the model
  • Using the remaining data to test the quality of the model’s predictions
  • This process is called data partitioning into a training and testing set
116
Q

What is supervised learning?

A
  • Samples with known labels are used to train a model to predict the label for unseen samples
  • Labels are the explicit assignment of information to each record
  • The model leans the relationship between the features and the target variable using the labelled dataset
117
Q

What happens during the learning phase in machine learning?

A
  • Called the training phase
  • The model is unstructured
  • Once the model is constructed it can be used to predict target variables through inference.
118
Q

What are classification and regression tasks?

A
  • Classification: using past data to predict future events, eg spam filtering
  • Regression: algorithms to predict numbers eg tomorrow’s air temperature
119
Q

Name some classification algorithms

A
  • Logistic regression
  • Decision trees
  • Random forest
  • Support Vector Machines (SVM)
  • Naïve Bayes
  • K Nearest Neighbours (KNN)
  • Gradient boosting
120
Q

Name some regression algorithms

A
  • Linear regression
  • Ridge
  • Lasso
  • Elastic Net Regression
  • Decision Trees
  • Random Forest
  • Support Vector Regression (SVR)
121
Q

What is unsupervised learning?

A
  • Uses algorithms to discover underlying patterns in unlabelled data
122
Q

What are common techniques for unsupervised learning?

A
  • Clustering
  • Anomaly detection
  • Dimensionality reduction
123
Q

What is clustering in unsupervised learning?

A
  • Assigning data points to clusters that are not known before the analysis
124
Q

What algorithms are used in clustering?

A
  • K-Means
  • Hierarchical Clustering
  • DBSCAN
125
Q

What is anomaly detection in unsupervised learning?

A
  • Data points assigned to either a “regular” or “unregular” cluster
126
Q

What might anomaly detection be used for?

A
  • Fraud detection
  • Intrusion detection
127
Q

What algorithms are used in anomaly detection?

A
  • Local Outlier Factor (LoF)
  • One-Class SVMs
  • Isolation Forests
128
Q

What is dimensionality reduction in unsupervised learning?

A
  • Transforming datasets to reduce the number of columns while preserving the information as much as possible
  • Used to make large datasets with many columns easier to understand
129
Q

What algorithms are used in dimensionality reduction?

A
  • Principal Component Analysis (PCA)
  • t-SNE
  • Linear Discriminant Analysis (LDA)
130
Q

What is Deep Learning?

A
  • Contains hidden layers - deep learning is where there are a number of hidden layers present
  • There is no defined or fixed number of hidden layers that must be present
  • Highly parallel
131
Q

How does Deep Learning work?

A
  • Uses back propagation algorithms to train neural networks to find suitable weights
132
Q

What hardware can be used to efficiently train deep learning models?

A

GPUs due to their multiple processing units

133
Q

What are examples of deep learning models?

A

*Artificial Neural Networks
*Convolutional Neural Networks

134
Q

What are Artificial Neural Networks?

A
  • Inspired by biological neural networks
  • Made up of nodes and connections
  • Positive weights indicate strong connections between nodes
  • Negative weights are used to discourage connections
  • Basic unit is a perceptron
135
Q

How do simple linear perceptrons work?

A
  • Multiple weighted inputs are added together
  • Outputs are activated if a threshold is met
136
Q

How can simple linear perceptrons be expanded?

A
  • Concatenating multiple layers
  • Using certain activation functions
137
Q

What are multi-layer perceptrons?

A
  • More complex
  • Used to solve nonlinear separate problems
  • Additional layers are called hidden layers
138
Q

What are Convolutional Neural Networks?

A
  • Performant solutions for object recognition in images
  • If a CNN has multiple layers it can perform automatic extraction of informative features
139
Q

What is Reinforcement Learning?

A
  • Not strictly deep learning
  • Used for optimised decision making
  • Rewards agents for interacting with the simulation environment
  • Can use deep learning algorithms as well as other algorithms
140
Q

What is Transfer Learning?

A
  • Re-training an existing model to match a particular use case
  • The last few layers of the model are removed and retrained with the new data
141
Q

When would you use a Deep Learning algorithm?

A
  • Situations with a lot of training data or non-tabular data such as images or videos
  • This is because of the advances in image processing, natural language processing and automatic feature learning
  • If data is scarce or strictly tabular a different algorithm may be a better choice
142
Q

What is Time Series Analysis?

A
  • Analysing data indexed by time
  • Learns patterns in historical data to forecast future values
  • The time index is important for data management and should be treated as such in the data system
143
Q

What technologies can be used for Time Series Analysis?

A
  • Holt-Winters Smoothing
  • Autoregressive Moving Average (ARMA) models
  • Extensions like ARIMA, SARIMA and SARIMAX models
  • Ensemble models like TBATS
144
Q

How are Time Series Analysis and Machine Learning similar?

A
  • Machine Learning uses multiple features to predict labels
  • Time Series Analysis uses one-time indexed variables to predict future values
  • Automated feature extraction libraries for Time Series Analysis have been adapted for use in machine learning.
145
Q

What is MLOps?

A
  • Machine Learning Operations
  • Frameworks used to streamline machine learning applications
146
Q

What is CRISP-DM?

A
  • Cross Industry Standard for Data Mining
  • Canonical framework for development of machine learning models
  • Requires collaboration between data management and machine learning teams
147
Q

What are the stages of the CRISP-DM model?

A
  • Business understanding
  • Data understanding
  • Data preparation
  • Modelling
  • Evaluation
  • Deployment
148
Q

What is Data Reporting?

A
  • The final phase of the data processing lifecycle
  • Shows insights from aggregated information
  • Uses Business Intelligence (BI)
149
Q

How is data reporting monitored?

A
  • Metrics
  • KPIs
150
Q

What output does data reporting produce?

A
  • Visualisations
  • Dashboards
  • Text reports
151
Q

What is the objective of data reporting?

A

Develop understanding and improve decisions based on information rather than intuition or traditional rules

152
Q

What is Business Intelligence?

A
  • Processes and tools for data analysis to gain actionable insights about an organisation’s operations
  • Looks at internal and external sources of data to reflect the organisation’s reality
153
Q

What can Business Intelligence be applied to?

A
  • Data-driven decisions
  • Business decisions
154
Q

What insights can BI offer?

A
  • Decision making based on evidence
  • Real time analytics
  • Details about processes
  • Discovery of new business opportunities
  • Control planning commitments
  • Monitoring using KPIs
155
Q

What are the key features of BI tools?

A
  • Designed to be easy to use
  • Address data integration and analysis
  • Offer a GUI for data integration to create ETL processes
  • Supports creation of dashboard and interactive reports
  • Focus on visual representation and interactive data exploration
156
Q

What are examples of available BI tools?

A
  • Tableau
  • Microsoft Power BI
  • Qlick Sense