converted csv file Flashcards
testing only
What are the five stages of the data processing lifecycle?
- Data ingestion and integration
- Data Processing
- Data Storage
- Data Analysis
- Reporting
What does the first stage of the data processing lifecycle involve?
- Collecting data from a variety of sources, transforming the data (if needed to match the target storage) and then storing in the target storage
- When data is integrated from multiple sources we have to be aware of data heterogeneity
What types of data make up heterogeneous data?
- XML files
- JSON Files
- Weblogs and Cookies
- SQL Queries
- Flat files
Why is heterogeneous data increasing?
Rise in technology increasing data produced
What is data heterogeneity?
- Data made up of different types, sources, structures or formats.
- May be in a structured, unstructured or semi-structured format
What is structured data?
- Conforms to a well-defined schema - schema on write
- Often tabular (rows are datapoints and columns are attributes)
Where is structured data stored?
- Relational Databases
- Data Warehouses
- Legacy Data Systems
What is semi-structured data?
- Scheme is not completely defined by a data model – no schema on write
- May be in the format of: HTML files, XML files or JSON files
- The size, order and contents of the elements can be different
- Often used with IoT devices
What formats might semi-structured data be in and where might they be used?
- HTML files
- XML files
- JSON files
- Often used in IoT devices
What is unstructured data?
- No formal description of schema - no schema on write
- Human readable, requires pre-processing for a computer to extract information.
What types of file formats make up unstructured data?
- Text files
- Image Files
*Video Files - Audio Files
What are data ingestion and integration frameworks?
- Often carried out as a single technical solution
- Been used in data warehouses for a long time
- Uses the ETL process
What is the ETL process?
- Extract: collects raw data, often in a structured format
- Transform: processes the data into a format matching the end destination
- Load: stores the transformed data into its new storage location
What might be carried out during the transformation process of ETL?
- Data cleaning
*Data enrichment - Feature engineering
Where is ETL traditionally used?
Batch processing
What other approaches can be used instead of ETL?
- IoT Hubs
- Digital Twins
- Data Pipeline Orchestrators
- Bulk import tools
- Data Streaming Platforms
What are the requirements for a data integration tool?
- Different protocols to support data collection from different sources
- Support for integration processes across different hardware and operating systems
- Scalability and adaptability
- Integrated capabilities for transformation operations (including fundamental and complex transformations)
- Security mechanisms to protect the data in the pipeline
- Visualisation of data flow (not necessary, but offered by many tools)
What is data variety?
The diversity of data types
What is data veracity?
The level of trust in the collected data
What is meant by data velocity?
Tthe speed of data generation and its movement
What are challenges with data integration and ingestion?
- The increasing variety and veracity of data
- Processing large amounts of data with high velocity
- New requirements due to increased interest in new technologies
- Cybersecurity – data needs to be secure, trusted and accountable
- Encryption to protect data during transport
How do data processing frameworks work?
- Distribute storage and processing over several nodes
- Transform data in several steps
- Can efficiently store, access and process large amounts of data
How is the transformation process generally modelled in a data processing framework?
As a Directed Acyclic Graph
How are Directed Acyclic Graphs used in data processing?
- Each stage has an input and an output
- Inputs can be used for multiple tasks so dependencies are clearly defined and there are no feedback loops
What are the components of a Directed Acyclic Graph used in data processing?
- Orchestrator: supervises task execution
- Scheduler: coordinates tasks
- Executor: runs the tasks
- Metadata component: monitors the state of the pipeline
What is batch processing?
- Raw data are collected at given time intervals (hourly, daily, weekly)
- Data is transformed and loaded into centralised storage
- Process can be automated to run without human intervention
What is stream processing?
- Real-time processing
- Shortens processing intervals into smaller windows (typically sub seconds)
- May use an event driven approach where data are processed as soon as they are available
What is the event driven approach in stream processing?
- Often implemented as a publish-subscribe architecture
- Data producing sources notifies the stream processor of new data/events
What is the difference between batch processing and stream processing?
- Batch data are bound (we know the size and number of data entries before processing)
- Stream data are unbound (size and number of data is unknown).
What data batch size does batch processing handle?
Large data batches
What data batch size does stream processing handle?
Individual records or microbatches
What is the latency of batch and stream processing?
- Batch processing is high latency
- Stream processing is low latency
What type of tasks is batch processing used for?
Complex data processing
What types of tasks is stream processing used for?
Time sensitive data or quick insights
What is a Lambda architecture?
- Data Processing design patterns that supports both batch and stream processing to handle massive amounts of data
- Made up of a batch, speed and serving layer
How does a Lambda architecture work?
- Abstracts the underlying implementation.
- Enables complex analysis with batch processing and quick insights through stream processing
What is Kappa Architecture and how does it work?
- No serving layer
- Batch and speed layers are integrated into a single processing layer which is capable of both batch and stream processing
What are the benefits of using Kappa architecture?
- Easier to implement than lambda architecture
- Simplified architecture
- Easier to set up and maintain
- More secure as there is less attack surface.
Is it more efficient to set up your own processing architecture or use a managed service?
- More resource efficient to use a managed service that setting up, maintaining and administering your own service.
- AWS, Azure and Google Cloud Platform have the biggest market share.
- Apache also offers solutions through Hadoop
What services does Azure offer for batch processing?
- Azure Synapse – advanced analytics
- Azure Data Lake Analytics – analytical services for Azure Data Lake
- HDInsight – for Hadoop
- Databricks – large scale analytics based on Spark
What are examples of use cases for stream processing?
- Fraud detection
- Social media analysis
- Log monitoring
- Customer analysis
What services does Azure offer for stream processing?
- HDInsight with Spark or Stream
- Databricks
- Azure Stream Analytics
- Azure Functions Azure App Service Webjobs
- IoT and Event Hubs
What do you need to consider when choosing a stream processing solution?
- Which programming languages are supported
- Programming paradigm (declarative or imperative)
- Pricing model
- Available connectors.
What are the different pricing models for stream processing?
- Per Streaming Unit
- Per Active Cluster Hour
- Per Executed Function
What are data sources?
Systems that generate data
What are data sinks?
- Storage systems designed to receive data from different sources
- Used for processing, analysis or retrieval
What components are available for lambda architecture on AWS?
- Kinesis Firehose for data ingestion
- S3 (Amazon Simple Storage Service) - cloud based distributed file system
- AWS Glue for batch processing
- Athena: data warehouse
- Amazon QuickSight: reporting tool
What is primary data storage?
Holding data in memory or the CPU during program execution
What is secondary data storage?
Storing data in hardware devices so that the date is available for future use.
What is physical secondary storage?
Hard Disk Drives, USB Flash Drives, SD Cards, Solid State Drives or other physical media used as direct attached storage (a storage drive that is directly connected to a computer).
What is cloud storage?
- Storage solutions deployed over a network
- An infrastructure of interconnected servers
Why is cloud storage suited for data intensive applications?
- Can store large amounts of data
- Distribute data across the physical storage of individual machines
What are the two ways stored files can be encoded?
Text files or Binary files. The file content of both is a sequence of bits.
What are text files?
- Human readable files
- The bits represent characters
- Can be stored as plain text (txt) or rich text (rtf)
- May also be a CSV table
What are binary files?
- Computer readable files
- A sequence of bytes, optimised for computer access
- Better at data compression – faster access and smaller files
- Images, audio, video files (or any file that needs translation by a computer program to read the content)
What is data serialisation?
The process of translating text data into binary data
What is data de-serialisation?
The process of translating binary data into text data
What are the five fundamental storage categories?
- Files systems
- Data Lakes
- Data Warehouses
- Relational Databases
- NoSQL databases
What are local file systems?
- Generally store unstructured data objects
- Hierarchy of drives, directories, subdirectories and files
- Generally focus on sharing and access aspects of small datasets rather than large distributed datasets
What are cloud-based file systems?
A collaborative platform to store and share files remotely
What features do cloud-based file systems offer?
- Automated backups
- Synchronisation of data
- User-specific file or folder access
- Versioning
- Data security services
- User friendly interface
What are examples of cloud based file systems?
- Sync
- pCloud
- iceDrive
- Google Drive
- Sharepoint
- OneDrive
- Dropbox
What is failover management?
The ability to switch automatically and seamlessly to a reliable backup system
What are examples of managed cloud file systems?
- Amazon S3: distributed storage for objects, with files and metadata stored in buckets
- Amazon Elastic File System: used for shared file storage of Elastic Cloud Compute instances (which are cloud-based AWS virtual machines)
- Google Cloud Platform provides Cloud Storage as a managed service for test and binary objects (uses same technical approach as HDFS)
- Azure Blob Storage: data stored in containers and organised into virtual folders.
- Azure File Storage:mountable file system provided shared access
- Azure Queues: stream and batch processing
- Azure Table Storage:column oriented NoSQL database
Why is Aure Blob Storage easily converted to an Azure Data Lake?
- Data is stored in containers
- Organised into virtual folders
- Interally the data is stored as flat storage
What are Data Lakes?
- Large storage repositories for raw and unstructured data
- Data is usually integrated from multiple departments and from various data sources
- Data are stored with metadata to support different types of analysis
- Generally use different stages for different processing levels.
How does Databricks’ Delta Lake process data in stages?
- Bronze stage: holds raw and untreated data
- Silver stage: Data are refined and preprocessed
- Gold stage: Data are entirely prepared and usable for analysis (such as training machine learning models)
Which cloud providers offer a Data Lake service built on their distributed storage system?
- Azure Data Lake Store: built on Azure Blob Storage
- AWS Data Lake Solution: built on S3
- Google offers a Data Lake built on Cloud Storage
What are relational databases?
- Relational Database Management System -RDBMS
- Traditional solution for structured data
- Schema based model, data stored in tables
- Enforce a scheme on write
How does a RBDMS guarantee reliable transactions?
- Uses ACID properties
- Referential integrity
- Uses data normalisation
What does data normalisation in a RDBMS enable?
Enables SQL to perform efficient queries
What do rows contain in a RDBMS?
Records
What do columns contain in a RDBMS?
Attributes
What are ACID properties?
- Atomicity
- Consistency
- Isolation
- Durability
What is data normalisation?
Organising data so that it looks similar in all fields and records
What are examples of commercial RDBMS?
- Microsoft SQL Server
- Oracle Database
What are examples of open-source RDBMS?
- MariaDB
- PostgreSQL
- MySQL
- SQLite
What is Database as a Service (DBaaS)?
- Managed solutions
- Partition and distribute the data across clusters
What are the benefits of Database as a Service (DBaaS)?
- Automatically scale to the workload
- Perform automated backups, security audits and patches.
What are examples of Database as a Service solutions?
- Cloud SQL (Google)
- Relational Database Service (AWS)
- Aurora (AWS)
- Azure SQL Database
What are data warehouses?
- Store structured data
- Similar to relational databases
- Integrate data from various sources, aggregating it in a clear homogeneous format so that the data can be used in analysis
What are examples of data warehouse solutions?
- AWS Redshift
- Microsoft Azure Synapse
- Google Cloud BigQuery
- Snowflake
What are NoSQL databases?
- Do not enforce a scheme on write - infer a schema on read
- More flexible than RDBMS
What categories of data can be stored in a NoSQL database?
- Structured data
- Semi-structured data
- Unstructured data
What are the disadvantages of using NoSQL databases?
- Can’t be sure that data follows a specific structure
- Can be less efficient than RDBMS, especially when it comes to joining data
What is the key difference between a RDBMS and NoSQL database?
RDBMS uses referential integrity, and NoSQL database uses flexible schema.
What are the different types of NoSQL Databases?
- Key-value oriented
- Document oriented
- Column oriented
- Graph oriented
What are key value oriented NoSQL databases?
- Similar to dictionaries, each value is mapped to a key
What are examples of key value oriented database solutions?
- Redis
- Memcached
- etcd
- Riak KV
- LevelDB
- Amazon Simple DB
What might key value oriented databases be used for?
- Storing user profiles and session information in web apps
- Storing contents of shopping carts in e-commerce
- Storing product details for e-commerce
- Structural information for system maintenance (eg IP forwarding tables)
- IoT readings
What are document oriented databases?
- Use a key-value approach, but store objects in collections of key-value pairs
- Allow nested data structures
What file types can document oriented databases hold?
- XML
- JSON
- BSON
What are JSON and BSON files?
- JSON: JavaScript Object Notation file
- BSON: a binary JSON file
What are examples of document oriented database solutions?
- MongoDB
- Couchbase
- CouchDB
- Google Cloud Firestore
- Rethink DB
- Amazon Document DB
What are column oriented databases?
- Also called wide column stores
- Store records by column, rather than by row
- Columns can be grouped by similar access patterns into column families
What are the benefits of column oriented databases?
- Optimised for analysis involving frequent aggregation
- Allows for direct access to columns
- Efficient data compression
What might a table in a column oriented database be referred to as?
A Keyspace
What are examples of column oriented database solutions?
- Cassandra
- Hbase
- Microsoft Azure Table Storage
- Amazon Keyspaces
What are graph oriented databases?
- Designed for heavily interconnected data
- Nodes represent entities and edges represent the relationships between entities
- Uses directional connections to show relationships
What might graph oriented databases be used for?
- Topological maps
- Routing systems (ie GPS)
- Social networks
What are examples of graph oriented database solutions?
- Neo4j (most popular)
- JanusGraph
- TigerGraph
- NebulaGraph
What are multi-model databases?
- Some databases provide multi model APIs
- Many databases can be categorised with a main model, but also provide a secondary model
How does Azure CosmosDB provide a multi-model approach?
- Provides APIs for classic SQL, MongoDB, Cassandra or Gremlin (with a graph based approach)
- Supports multiple data models such as document oriented, column oriented, key value oriented and graph oriented.
What database solutions have a multi-model approach?
- Amazon DynamoDB
- MarkLogic
- Aerospike
- Google Cloud BigTable
- Ignite
- ArangoDB
- OrientDB
- Apache Drill
- Amazon Neptune
- Apache Druid
- Graph DB.
What is the purpose of data analysis?
- Gaining insights into the data, extracting information and knowledge from it
What are the different types of data analysis?
*Descriptive
* Prescriptive
* Predictive
What is descriptive data analysis?
- The explanation of past or present events
- Uses statistical analysis of historical data
- Data driven models can be used for more detailed analysis such as root-cause analysis
What is predictive data analysis?
- Predicts future events e.g stock prices or customer churn
- Data driven models are constructed based on historical data to learn underlying patterns. These models can project past events in the future to predict future occurrences eg stock prices based on past stock prices
What is prescriptive data analysis?
- Investigates the outcomes of different scenarios from models
- Recommends decisions leading to the most favourable predicted future event
- Used in climate impact research
What are some categories of data analysis?
- Machine Learning
- Deep Learning
- Time series analysis
What is the difference between artificial intelligence, machine learning and deep learning?
Artificial intelligence is the wider field of technology, and machine learning is a part of this. Deep learning is a specific field of machine learning.
What is machine learning?
- Automatic extraction of informative patterns, without explicit instructions of how to carry this out.
- Creates data driven models and is often carried out on tabular and structured data stored in a data lake or warehouse
What are the components of machine learning?
- Rows represent observation and are called samples or data points
- Columns represent attributes and are called features
- The “to be predicted”� column is called the label
What are the two types of machine learning?
- Supervised learning
- Unsupervised learning
What is a common approach to machine learning?
- Using a subset of the data to train the model
- Using the remaining data to test the quality of the model’s predictions
- This process is called data partitioning into a training and testing set
What is supervised learning?
- Samples with known labels are used to train a model to predict the label for unseen samples
- Labels are the explicit assignment of information to each record
- The model leans the relationship between the features and the target variable using the labelled dataset
What happens during the learning phase in machine learning?
- Called the training phase
- The model is unstructured
- Once the model is constructed it can be used to predict target variables through inference.
What are classification and regression tasks?
- Classification: using past data to predict future events, eg spam filtering
- Regression: algorithms to predict numbers eg tomorrow’s air temperature
Name some classification algorithms
- Logistic regression
- Decision trees
- Random forest
- Support Vector Machines (SVM)
- Naïve Bayes
- K Nearest Neighbours (KNN)
- Gradient boosting
Name some regression algorithms
- Linear regression
- Ridge
- Lasso
- Elastic Net Regression
- Decision Trees
- Random Forest
- Support Vector Regression (SVR)
What is unsupervised learning?
- Uses algorithms to discover underlying patterns in unlabelled data
What are common techniques for unsupervised learning?
- Clustering
- Anomaly detection
- Dimensionality reduction
What is clustering in unsupervised learning?
- Assigning data points to clusters that are not known before the analysis
What algorithms are used in clustering?
- K-Means
- Hierarchical Clustering
- DBSCAN
What is anomaly detection in unsupervised learning?
- Data points assigned to either a “regular”� or “unregular”� cluster
What might anomaly detection be used for?
- Fraud detection
- Intrusion detection
What algorithms are used in anomaly detection?
- Local Outlier Factor (LoF)
- One-Class SVMs
- Isolation Forests
What is dimensionality reduction in unsupervised learning?
- Transforming datasets to reduce the number of columns while preserving the information as much as possible
- Used to make large datasets with many columns easier to understand
What algorithms are used in dimensionality reduction?
- Principal Component Analysis (PCA)
- t-SNE
- Linear Discriminant Analysis (LDA)
What is Deep Learning?
- Contains hidden layers - deep learning is where there are a number of hidden layers present
- There is no defined or fixed number of hidden layers that must be present
- Highly parallel
How does Deep Learning work?
- Uses back propagation algorithms to train neural networks to find suitable weights
What hardware can be used to efficiently train deep learning models?
GPUs due to their multiple processing units
What are examples of deep learning models?
*Artificial Neural Networks
*Convolutional Neural Networks
What are Artificial Neural Networks?
- Inspired by biological neural networks
- Made up of nodes and connections
- Positive weights indicate strong connections between nodes
- Negative weights are used to discourage connections
- Basic unit is a perceptron
How do simple linear perceptrons work?
- Multiple weighted inputs are added together
- Outputs are activated if a threshold is met
How can simple linear perceptrons be expanded?
- Concatenating multiple layers
- Using certain activation functions
What are multi-layer perceptrons?
- More complex
- Used to solve nonlinear separate problems
- Additional layers are called hidden layers
What are Convolutional Neural Networks?
- Performant solutions for object recognition in images
- If a CNN has multiple layers it can perform automatic extraction of informative features
What is Reinforcement Learning?
- Not strictly deep learning
- Used for optimised decision making
- Rewards agents for interacting with the simulation environment
- Can use deep learning algorithms as well as other algorithms
What is Transfer Learning?
- Re-training an existing model to match a particular use case
- The last few layers of the model are removed and retrained with the new data
When would you use a Deep Learning algorithm?
- Situations with a lot of training data or non-tabular data such as images or videos
- This is because of the advances in image processing, natural language processing and automatic feature learning
- If data is scarce or strictly tabular a different algorithm may be a better choice
What is Time Series Analysis?
- Analysing data indexed by time
- Learns patterns in historical data to forecast future values
- The time index is important for data management and should be treated as such in the data system
What technologies can be used for Time Series Analysis?
- Holt-Winters Smoothing
- Autoregressive Moving Average (ARMA) models
- Extensions like ARIMA, SARIMA and SARIMAX models
- Ensemble models like TBATS
How are Time Series Analysis and Machine Learning similar?
- Machine Learning uses multiple features to predict labels
- Time Series Analysis uses one-time indexed variables to predict future values
- Automated feature extraction libraries for Time Series Analysis have been adapted for use in machine learning.
What is MLOps?
- Machine Learning Operations
- Frameworks used to streamline machine learning applications
What is CRISP-DM?
- Cross Industry Standard for Data Mining
- Canonical framework for development of machine learning models
- Requires collaboration between data management and machine learning teams
What are the stages of the CRISP-DM model?
- Business understanding
- Data understanding
- Data preparation
- Modelling
- Evaluation
- Deployment
What is Data Reporting?
- The final phase of the data processing lifecycle
- Shows insights from aggregated information
- Uses Business Intelligence (BI)
How is data reporting monitored?
- Metrics
- KPIs
What output does data reporting produce?
- Visualisations
- Dashboards
- Text reports
What is the objective of data reporting?
Develop understanding and improve decisions based on information rather than intuition or traditional rules
What is Business Intelligence?
- Processes and tools for data analysis to gain actionable insights about an organisation’s operations
- Looks at internal and external sources of data to reflect the organisation’s reality
What can Business Intelligence be applied to?
- Data-driven decisions
- Business decisions
What insights can BI offer?
- Decision making based on evidence
- Real time analytics
- Details about processes
- Discovery of new business opportunities
- Control planning commitments
- Monitoring using KPIs
What are the key features of BI tools?
- Designed to be easy to use
- Address data integration and analysis
- Offer a GUI for data integration to create ETL processes
- Supports creation of dashboard and interactive reports
- Focus on visual representation and interactive data exploration
What are examples of available BI tools?
- Tableau
- Microsoft Power BI
- Qlick Sense