Course 1: Introduction to Data Engineering Flashcards
Entities that form a modern data ecosystem
1 Data integrated from disparate sources
2 diff types of analysis/skills to generate insights
3 stakeholders to act/collaborate on insights
4 tools, apps, infrastructure to store, process, disseminate data
Roles and Responsibilities of Data Engineers
1 Extract, integrate, and organize data from disparate sources
2 Clean, transform, and prep data
3 design, store, and manage data repositories
Data Engineer Competencies
1 Programming
2 knowledge of systems and tech architectures
3 understanding of relational and non-relational databases
Roles and Responsibilities of Data Analysts
1 Inspect and clean data for deriving insights
2 identify correlations/patterns and apply statistical methods to analyze and mine data
3 visualize data to interpret and present findings
Data Analyst Competencies
1 good knowledge of spreadsheets, query writing, and statistical tools to create visuals
2 programming
3 analytical and story telling skills
Roles and Responsibilities of Data Scientist
1 analyze data for actionable insights
2 build machine learning models or deep learning models
Data Scientist Competencies
1 mathematics
2 statistics
3 fair understanding of programming languages, databases, and building data models
4 domain knowledge
Roles and Responsibilities of Business Analysts
1 leverage work of data analyst and scientists to look at implications for their business and recommend actions
Roles and Responsibilities of BI Analysts
1 same as business analyst except focus is on market forces and external influences
2 provide BI solutions
List tasks in typical data engineering lifecycle
1 collect data: by extracting, integrating, organizing data from disparate sources
2 process data: cleaning, transforming, prepping
3 storing data: for reliability, availability
Needs for collecting data
1 develop tools, workflows, processes
2 design, build, maintain scalable data architectures
3 store in databases, warehouses, lakes, other repositories
Needs for processing data
1 implement and maintain distributed systems for large-scale processing
2 design pipelines for extraction, transformation, and loading
3 design solutions for safeguarding, quality, privacy, and security
4 optimize tools, systems, and workflows for performance, reliability, and security
5 ensure regulatory and compliance guidelines
Needs for storing data
1 architect/implement data stores
2 ensure scalable systems
3 ensure tools/systems in place for privacy, security, compliance, monitoring, backup, and recovery
4 make data available to users through services, APIs, programs
5 interfaces and dashboards to present data
6 ensure measures/checks and balances in place for secure and right-based access
Elements of data engineering ecosystem
1 data 2 data repositories 3 data integration platforms 4 data pipelines 5 languages 6 BI and reporting tools
structured data with examples
objective facts and numbers that can be collected, exported, stored, and organized in typical databases — SQL databases, spreadsheets, OLTP (online transaction processing) systems
semi-structured data with examples
some organizational properties but lacks rigid schema — emails, binary executables (TCP/IP packets), zipped files
unstructured data with examples
does not have easily identifiable structure and cannot be organized in database of rows and columns — web pages, social media feeds, images, audio files, pdfs
standard file formats
1 delimited text - .CSV 2 microsoft excel - .XML spreadsheet or .XLSX 3 extensible markup language - .XML 4 portable document - .PDF 5 javascript object notation - .JSON
delimited text file
1 store data as text
2 each value separated by delimiter which is one or more characters that act as boundary bw values
3 .CSV or .TSV
microsoft excel file format
1 spreadsheet
2 open file format meaning accessible to other apps
3 can use and save all functions available in excel
4 secure format meaning it cannot save malicious code
extensible markup language file format
1 markup language with set rules for encoding data
2 readable by humans and machines
3 self-descriptive language
4 platform and programming language independent
5 simpler to share between data systems
portable document file format
1 developed by adobe
2 present documents independent of app software, hardware, or operating systems
3 can be viewed same way on any device
javascript object notation file format
1 text-based open standard designed for transmitting structured data over web 2 language independent data format 3 can be read in any language 4 easy 5 compatible with wide array of browsers 6 one of the best tools for sharing data
common sources of data
1 relational databases 2 flat files and XML databases 3 APIs and web services 4 web scraping 5 data streams and feeds
relational database examples
1 microsoft SQL server
2 oracle
3 MySQL
4 IBM db2
APIs and web services
1 multiple users or apps can interact with and obtain data for processing/analysis
2 listens for incoming requests, in form of user web requests or network requests from apps
3 returns data in plain text, HTML, XML, JSON, or media files
popular examples of APIs
twitter and facebook for tweets and posts
stock market APIs
data lookup and validation
web scraping or screen scraping
1 download specific data based on defined parameters
2 can extract text, contact info, images, videos, product items, etc.
popular uses of web scraping or screen scraping
1 providing pricing comps by collecting product details from retailer eCommerce websites
2 generating sales leads thru public data
3 extracting data from posts and authors on various forums
4 collecting training and testing models for machine learning
data streams and feeds
aggregating streams of data from instruments, IoT devices, GPS data, computer programs, websites, social media posts
popular data stream examples
1 stock market tickers for financial trading 2 retail transactions 3 surveillance and video feeds 4 social media feeds 5 sensors 6 web clicks 7 flight events
popular data stream technologies
1 kafka
2 apache spark
3 apache storm
RSS (really simple syndication) feeds
capturing updated data from online forums and news sites where data is refreshed on ongoing basis
types of languages with usage description
1 query - accessing and manipulating data
2 programming - developing apps and controlling app behavior
3 shell and scripting - ideal for repetitive and time-consuming operational tasks
typical operations performed by shell scripts
1 file manipulation 2 program execution 3 system admin tasks 4 installation for complex programs 5 executing routine backups 6 running batches
what is PowerShell and what is it used for?
- cross-platform automation tool and configuration framework by microsoft optimized for working with structured data
- data mining, building GUIs, creating charts, dashboards, and interactive reports
metadata
data that provides info about other data
3 main types of metadata with description
1 technical - defines data structures in repositories or platforms
2 process - processes that operate behind business systems like data warehouses, accounting systems, or CRM tools
3 business - info about data described in readily interpretable ways
metadata management
includes developing and administering policies and procedures to ensure info can be accessed and integrated from various sources and appropriately shared across enterprise
why is metadata management important?
help understand both business context and data lineage, which helps improve data governance
data repository
general term for data collected, organized, and isolated so that it can be used for business ops or mined for reporting
database
collection of data designed for input, storage, search, and modification
DBMS (database management system)
set of programs that creates and maintains the database
relational database (RDBMS) and difference from flat files
data organized into tabular format with rows and columns following a well defined structure and schema. optimized for data operations and querying unlike flat files
non-relational databases (NoSQL)
built for speed, flexibility, and scale making it possible to store data in schema-less fashion
data warehouse
central repository for info from disparate sources consolidated through ETL process that enables analytics and BI
big data stores
distributed computational and storage infrastructure to store, scale, and process very large data sets
popular cloud relational databases services
amazon relational database service RDS google cloud SQL IBM db2 on cloud oracle cloud SQL Azure
advantages of relational databases
1 create meaningful info by joining tables 2 flexibility 3 reduced redundancy 4 ease of backup and disaster recovery 5 ACID compliance
ACID (Atomicity, Consistency, Isolation, Durability) compliance
data in database remains accurate, consistent, reliable despite failures
limitations of relational databases
1 doesnt work well with semi-structured or unstructured data
data warehouse typical architecture 3 tiers
1 bottom - database servers, that extract data from various sources
2 middle - OLAP server, that allows user to process and analyze info coming from multiple database servers
3 top - client front-end, tools and apps used for querying, reporting, analyzing
popularly used data warehouses
1 teradata enterprise data warehouse 2 oracle exadata 3 IBM Db2 warehouse on cloud 4 IBM netezza performance server 5 amazon redshift 6 BigQuery by Google 7 Cloudera's enterprise data hub 8 Snowflake cloud data warehouse
data mart
sub-section of data warehouse built specifically for a business function or community of users
types of data marts
dependent, independent, hybrid
dependent data mart
sub-section of data warehouse, offers analytical capabilities for restricted area of the data warehouse therefore providing isolated security and performance
independent data mart
created from sources other than enterprise data warehouse, like internal operating systems or external data
hybrid data mart
combine inputs from enterprise data warehouse, internal systems, and external data
data lake
data repository that can store large amounts of any type of data in their native format (raw)
benefits of data lakes
1 can store all types of data
2 can scale based on storage capacity
3 saves time of defining structures, schemas, and transformations
4 can repurpose data in different ways for many use cases
considerations for choice of data repository
1 types of data 2 schema of data 3 performance 4 whether data is at rest or streaming 5 data encryption needs 6 volume 7 storage requirements 8 frequency of access 9 organizations policies
data extraction types with description and tools
1 batch processing - data is moved in large chunks from source to target system - Stitch, Blendo
2 stream processing - moved in real-time and tranformed in transit - Apache Samza, Apache Storm, Apache Kafka
types of loading in ETL process with descriptions
1 initial - populating all data in repository
2 incremental - applying ongoing updates and mods periodically
3 full refresh - erasing contents of one or more tables and reloading with fresh data
popular ETL tools
1 IBM Infosphere 2 AWS Glue 3 Impravado 4 Skyvia 5 HEVO 6 Informatica PowerCenter
advantages of ELT process
1 processing large sets of unstructured and non-relational data
2 shortened cycle between extraction and delivery
3 can ingest data immediately as available
4 greater flexibility for exploratory analytics
data integration
discipline of the practices, architectural techniques, and tools that allow orgs to ingest, transform, combine, and provision data across various data types
big data
dynamic, large, and disparate volumes of data being created by people, tools, and machines
elements of big data
velocity, volume, variety, veracity, value
big data velocity
speed at which data accumulates
big data volume
scale of the data
big data variety
diversity of the data
big data veracity
quality and origin of data and conformity to facts and accuracy
big data value
ability and need to turn data into value
3 open source big data technologies
hadoop, hive, apache spark
hadoop
collection of tools that provides distributred storage and processing of big data
hive
data warehouse for data query and analysis built on top of hadoop
spark
distributed data analytics framework designed to perform complex data analytics in real-time
hadoop benefits
1 Better real time data-driven decisions
2 improved data acess and analysis
3 data offload and consolidation
4 main hadoop components
1 hadoop distributed file system (HDFS) is storage system for bid data that runs on multiple commodity hardware connected through network
2
HDFS benefits
1 fast recovery from hardware failures
2 access to streaming data because of high throughput rates
3 accommodation of large datasets because it can scale to hundreds of nodes in single cluster
4 portability, across multiple hardware platforms and compatible with multiple operating systems
hive benefits
1 data warehousing tasks such as ETL reporting, and data analysis
2 easy access to data via SQL
data platform layers
1 collection 2 storage and integration 3 processing 4 analysis and user interface 5 data pipeline
data collection layer
1 connect to sources
2 transfer data in streaming, batch, or both
3 maintain metadata of collection
data collection layer tools
google cloud DataFlow IBM streams IBM streaming analytics on cloud amazon kinesis apache kafka
data storage layer
1 store data for processing
2 transform and merge extracted data, logically or physically
3 make data available for processing in streaming or batch modes
data storage tools
ibm DB2 microsoft sql server mysql oracle database postgreSQL
data processing layer
1 read data from storage and apply transformations
2 support popular querying tools and programming languages
3 scale to meet the processing demands of a growing dataset
primary considerations for designing a data store
1 type of data 2 volume 3 intended use 4 storage 5 privacy, security, and governance
scalability
capability to handle growth in the amount of data, workloads, and users
normalization of the database
process of efficiently organizing data in a database
throughput or latency
rate at which info can be read from and written to the storage and the time it takes to access a specific location
Facets of security in data lifecycle management
physical infrastructure
network
application
data
3 components to creating an effective strategy for info security (known as CIA triad)
1 Confidentiality - through controlling unauthorized access
2 Integrity - through validating that your resources are trustworthy
3 Availability - ensuring users have access to resources when they need
Rss feeds
data source typically used for capturing updated data from online forums and news sites
popular data exchange platforms
aws data exchange
crunchbase
lotame
snowflake
data exchange platforms
facilitate the exchange of data while ensuring security and governance maintained
importing data process
combining data from different sources to provide combined view and a single interface where you can query and manipulate data
data wrangling
iterative process that involves data exploration, transformation, validation, and making data available
transformation tasks with definition
1 structuring - actions that change the form and schema of your data
2 normalization/denormalization - cleaning the database of unused data and reducing redundancy
3 cleaning - fix irregularities in data
types of performance threats to data pipelines
scalability
app failures
scheduled jobs not starting on schedule
tool incompatibilities
performance metrics for a data pipleline with definition
1 latency - time it takes for a services to fulfill a request
2 failures - rate at which a service fails
3 resource utilitization
4 traffic - number of user requests received in a given period
steps to troubleshoot performance issues in data pipeline
1 collect as much info as possible
2 check if working with all the right versions of software
3 check the logs and metrics to isolate whether issue is related to infrastructure, data, software, or combo
performance metrics for a database
1 system outages 2 capacity utilization 3 application slowdown 4 performance of queries 5 conflicting activities being executed by multiple users giving requests at the same time 6 batch activities
capacity planning
process of determining the optimal hardware and software resources required for performance
database monitoring tools def
take frequent snapshots of the performance indicators of a database
app management performance management tools def
help measure and monitor the performance of applications by tracking request response time and error messages and the amount of resources being utilized by each process
query performance monitoring tools def
gather stats about query throughput, execution performance, resource utililization and patterns
pseudonymization
de-identification process where personally identifiable info is replaced with artificial identifiers so data cant be tracked back to someones identity
data erasure
software-based method of permanently clearing data from a system by overwriting
DataOps
collaborative management practice focused on improving the communication, integration, and automation of data flows between data managers and consumers