Course 1: Introduction to Data Engineering Flashcards
Entities that form a modern data ecosystem
1 Data integrated from disparate sources
2 diff types of analysis/skills to generate insights
3 stakeholders to act/collaborate on insights
4 tools, apps, infrastructure to store, process, disseminate data
Roles and Responsibilities of Data Engineers
1 Extract, integrate, and organize data from disparate sources
2 Clean, transform, and prep data
3 design, store, and manage data repositories
Data Engineer Competencies
1 Programming
2 knowledge of systems and tech architectures
3 understanding of relational and non-relational databases
Roles and Responsibilities of Data Analysts
1 Inspect and clean data for deriving insights
2 identify correlations/patterns and apply statistical methods to analyze and mine data
3 visualize data to interpret and present findings
Data Analyst Competencies
1 good knowledge of spreadsheets, query writing, and statistical tools to create visuals
2 programming
3 analytical and story telling skills
Roles and Responsibilities of Data Scientist
1 analyze data for actionable insights
2 build machine learning models or deep learning models
Data Scientist Competencies
1 mathematics
2 statistics
3 fair understanding of programming languages, databases, and building data models
4 domain knowledge
Roles and Responsibilities of Business Analysts
1 leverage work of data analyst and scientists to look at implications for their business and recommend actions
Roles and Responsibilities of BI Analysts
1 same as business analyst except focus is on market forces and external influences
2 provide BI solutions
List tasks in typical data engineering lifecycle
1 collect data: by extracting, integrating, organizing data from disparate sources
2 process data: cleaning, transforming, prepping
3 storing data: for reliability, availability
Needs for collecting data
1 develop tools, workflows, processes
2 design, build, maintain scalable data architectures
3 store in databases, warehouses, lakes, other repositories
Needs for processing data
1 implement and maintain distributed systems for large-scale processing
2 design pipelines for extraction, transformation, and loading
3 design solutions for safeguarding, quality, privacy, and security
4 optimize tools, systems, and workflows for performance, reliability, and security
5 ensure regulatory and compliance guidelines
Needs for storing data
1 architect/implement data stores
2 ensure scalable systems
3 ensure tools/systems in place for privacy, security, compliance, monitoring, backup, and recovery
4 make data available to users through services, APIs, programs
5 interfaces and dashboards to present data
6 ensure measures/checks and balances in place for secure and right-based access
Elements of data engineering ecosystem
1 data 2 data repositories 3 data integration platforms 4 data pipelines 5 languages 6 BI and reporting tools
structured data with examples
objective facts and numbers that can be collected, exported, stored, and organized in typical databases — SQL databases, spreadsheets, OLTP (online transaction processing) systems
semi-structured data with examples
some organizational properties but lacks rigid schema — emails, binary executables (TCP/IP packets), zipped files
unstructured data with examples
does not have easily identifiable structure and cannot be organized in database of rows and columns — web pages, social media feeds, images, audio files, pdfs
standard file formats
1 delimited text - .CSV 2 microsoft excel - .XML spreadsheet or .XLSX 3 extensible markup language - .XML 4 portable document - .PDF 5 javascript object notation - .JSON
delimited text file
1 store data as text
2 each value separated by delimiter which is one or more characters that act as boundary bw values
3 .CSV or .TSV
microsoft excel file format
1 spreadsheet
2 open file format meaning accessible to other apps
3 can use and save all functions available in excel
4 secure format meaning it cannot save malicious code
extensible markup language file format
1 markup language with set rules for encoding data
2 readable by humans and machines
3 self-descriptive language
4 platform and programming language independent
5 simpler to share between data systems
portable document file format
1 developed by adobe
2 present documents independent of app software, hardware, or operating systems
3 can be viewed same way on any device
javascript object notation file format
1 text-based open standard designed for transmitting structured data over web 2 language independent data format 3 can be read in any language 4 easy 5 compatible with wide array of browsers 6 one of the best tools for sharing data
common sources of data
1 relational databases 2 flat files and XML databases 3 APIs and web services 4 web scraping 5 data streams and feeds
relational database examples
1 microsoft SQL server
2 oracle
3 MySQL
4 IBM db2
APIs and web services
1 multiple users or apps can interact with and obtain data for processing/analysis
2 listens for incoming requests, in form of user web requests or network requests from apps
3 returns data in plain text, HTML, XML, JSON, or media files
popular examples of APIs
twitter and facebook for tweets and posts
stock market APIs
data lookup and validation
web scraping or screen scraping
1 download specific data based on defined parameters
2 can extract text, contact info, images, videos, product items, etc.
popular uses of web scraping or screen scraping
1 providing pricing comps by collecting product details from retailer eCommerce websites
2 generating sales leads thru public data
3 extracting data from posts and authors on various forums
4 collecting training and testing models for machine learning
data streams and feeds
aggregating streams of data from instruments, IoT devices, GPS data, computer programs, websites, social media posts
popular data stream examples
1 stock market tickers for financial trading 2 retail transactions 3 surveillance and video feeds 4 social media feeds 5 sensors 6 web clicks 7 flight events
popular data stream technologies
1 kafka
2 apache spark
3 apache storm
RSS (really simple syndication) feeds
capturing updated data from online forums and news sites where data is refreshed on ongoing basis
types of languages with usage description
1 query - accessing and manipulating data
2 programming - developing apps and controlling app behavior
3 shell and scripting - ideal for repetitive and time-consuming operational tasks
typical operations performed by shell scripts
1 file manipulation 2 program execution 3 system admin tasks 4 installation for complex programs 5 executing routine backups 6 running batches
what is PowerShell and what is it used for?
- cross-platform automation tool and configuration framework by microsoft optimized for working with structured data
- data mining, building GUIs, creating charts, dashboards, and interactive reports
metadata
data that provides info about other data
3 main types of metadata with description
1 technical - defines data structures in repositories or platforms
2 process - processes that operate behind business systems like data warehouses, accounting systems, or CRM tools
3 business - info about data described in readily interpretable ways
metadata management
includes developing and administering policies and procedures to ensure info can be accessed and integrated from various sources and appropriately shared across enterprise
why is metadata management important?
help understand both business context and data lineage, which helps improve data governance
data repository
general term for data collected, organized, and isolated so that it can be used for business ops or mined for reporting
database
collection of data designed for input, storage, search, and modification
DBMS (database management system)
set of programs that creates and maintains the database
relational database (RDBMS) and difference from flat files
data organized into tabular format with rows and columns following a well defined structure and schema. optimized for data operations and querying unlike flat files