Course 1: Introduction to Data Engineering Flashcards

Question 1

Q

Entities that form a modern data ecosystem

Answer

A

1 Data integrated from disparate sources
2 diff types of analysis/skills to generate insights
3 stakeholders to act/collaborate on insights
4 tools, apps, infrastructure to store, process, disseminate data

Question 2

Q

Roles and Responsibilities of Data Engineers

Answer

A

1 Extract, integrate, and organize data from disparate sources
2 Clean, transform, and prep data
3 design, store, and manage data repositories

Question 3

Q

Data Engineer Competencies

Answer

A

1 Programming
2 knowledge of systems and tech architectures
3 understanding of relational and non-relational databases

Question 4

Q

Roles and Responsibilities of Data Analysts

Answer

A

1 Inspect and clean data for deriving insights
2 identify correlations/patterns and apply statistical methods to analyze and mine data
3 visualize data to interpret and present findings

Question 5

Q

Data Analyst Competencies

Answer

A

1 good knowledge of spreadsheets, query writing, and statistical tools to create visuals
2 programming
3 analytical and story telling skills

Question 6

Q

Roles and Responsibilities of Data Scientist

Answer

A

1 analyze data for actionable insights

2 build machine learning models or deep learning models

Question 7

Q

Data Scientist Competencies

Answer

A

1 mathematics
2 statistics
3 fair understanding of programming languages, databases, and building data models
4 domain knowledge

Question 8

Q

Roles and Responsibilities of Business Analysts

Answer

A

1 leverage work of data analyst and scientists to look at implications for their business and recommend actions

Question 9

Q

Roles and Responsibilities of BI Analysts

Answer

A

1 same as business analyst except focus is on market forces and external influences
2 provide BI solutions

Question 10

Q

List tasks in typical data engineering lifecycle

Answer

A

1 collect data: by extracting, integrating, organizing data from disparate sources
2 process data: cleaning, transforming, prepping
3 storing data: for reliability, availability

Question 11

Q

Needs for collecting data

Answer

A

1 develop tools, workflows, processes
2 design, build, maintain scalable data architectures
3 store in databases, warehouses, lakes, other repositories

Question 12

Q

Needs for processing data

Answer

A

1 implement and maintain distributed systems for large-scale processing
2 design pipelines for extraction, transformation, and loading
3 design solutions for safeguarding, quality, privacy, and security
4 optimize tools, systems, and workflows for performance, reliability, and security
5 ensure regulatory and compliance guidelines

Question 13

Q

Needs for storing data

Answer

A

1 architect/implement data stores
2 ensure scalable systems
3 ensure tools/systems in place for privacy, security, compliance, monitoring, backup, and recovery
4 make data available to users through services, APIs, programs
5 interfaces and dashboards to present data
6 ensure measures/checks and balances in place for secure and right-based access

Question 14

Q

Elements of data engineering ecosystem

Answer

A

1 data
2 data repositories
3 data integration platforms
4 data pipelines
5 languages
6 BI and reporting tools

Question 15

Q

structured data with examples

Answer

A

objective facts and numbers that can be collected, exported, stored, and organized in typical databases — SQL databases, spreadsheets, OLTP (online transaction processing) systems

Question 16

Q

semi-structured data with examples

Answer

A

some organizational properties but lacks rigid schema — emails, binary executables (TCP/IP packets), zipped files

Question 17

Q

unstructured data with examples

Answer

A

does not have easily identifiable structure and cannot be organized in database of rows and columns — web pages, social media feeds, images, audio files, pdfs

Question 18

Q

standard file formats

Answer

A

1 delimited text - .CSV
2 microsoft excel - .XML spreadsheet or .XLSX
3 extensible markup language - .XML
4 portable document - .PDF
5 javascript object notation - .JSON

Question 19

Q

delimited text file

Answer

A

1 store data as text
2 each value separated by delimiter which is one or more characters that act as boundary bw values
3 .CSV or .TSV

Question 20

Q

microsoft excel file format

Answer

A

1 spreadsheet
2 open file format meaning accessible to other apps
3 can use and save all functions available in excel
4 secure format meaning it cannot save malicious code

Question 21

Q

extensible markup language file format

Answer

A

1 markup language with set rules for encoding data
2 readable by humans and machines
3 self-descriptive language
4 platform and programming language independent
5 simpler to share between data systems

Question 22

Q

portable document file format

Answer

A

1 developed by adobe
2 present documents independent of app software, hardware, or operating systems
3 can be viewed same way on any device

Question 23

Q

javascript object notation file format

Answer

A

1 text-based open standard designed for transmitting structured data over web
2 language independent data format
3 can be read in any language
4 easy
5 compatible with wide array of browsers
6 one of the best tools for sharing data

Question 24

Q

common sources of data

Answer

A

1 relational databases
2 flat files and XML databases
3 APIs and web services
4 web scraping
5 data streams and feeds

Question 25

Q

relational database examples

Answer

A

1 microsoft SQL server
2 oracle
3 MySQL
4 IBM db2

Question 26

Q

APIs and web services

Answer

A

1 multiple users or apps can interact with and obtain data for processing/analysis
2 listens for incoming requests, in form of user web requests or network requests from apps
3 returns data in plain text, HTML, XML, JSON, or media files

Question 27

Q

popular examples of APIs

Answer

A

twitter and facebook for tweets and posts
stock market APIs
data lookup and validation

Question 28

Q

web scraping or screen scraping

Answer

A

1 download specific data based on defined parameters

2 can extract text, contact info, images, videos, product items, etc.

Question 29

Q

popular uses of web scraping or screen scraping

Answer

A

1 providing pricing comps by collecting product details from retailer eCommerce websites
2 generating sales leads thru public data
3 extracting data from posts and authors on various forums
4 collecting training and testing models for machine learning

Question 30

Q

data streams and feeds

Answer

A

aggregating streams of data from instruments, IoT devices, GPS data, computer programs, websites, social media posts

Question 31

Q

popular data stream examples

Answer

A

1 stock market tickers for financial trading
2 retail transactions
3 surveillance and video feeds
4 social media feeds
5 sensors
6 web clicks
7 flight events

Question 32

Q

popular data stream technologies

Answer

A

1 kafka
2 apache spark
3 apache storm

Question 33

Q

RSS (really simple syndication) feeds

Answer

A

capturing updated data from online forums and news sites where data is refreshed on ongoing basis

Question 34

Q

types of languages with usage description

Answer

A

1 query - accessing and manipulating data
2 programming - developing apps and controlling app behavior
3 shell and scripting - ideal for repetitive and time-consuming operational tasks

Question 35

Q

typical operations performed by shell scripts

Answer

A

1 file manipulation
2 program execution
3 system admin tasks
4 installation for complex programs
5 executing routine backups
6 running batches

Question 36

Q

what is PowerShell and what is it used for?

Answer

A

cross-platform automation tool and configuration framework by microsoft optimized for working with structured data
data mining, building GUIs, creating charts, dashboards, and interactive reports

Question 37

Q

metadata

Answer

A

data that provides info about other data

Question 38

Q

3 main types of metadata with description

Answer

A

1 technical - defines data structures in repositories or platforms
2 process - processes that operate behind business systems like data warehouses, accounting systems, or CRM tools
3 business - info about data described in readily interpretable ways

Question 39

Q

metadata management

Answer

A

includes developing and administering policies and procedures to ensure info can be accessed and integrated from various sources and appropriately shared across enterprise

Question 40

Q

why is metadata management important?

Answer

A

help understand both business context and data lineage, which helps improve data governance

Question 41

Q

data repository

Answer

A

general term for data collected, organized, and isolated so that it can be used for business ops or mined for reporting

Question 42

Q

database

Answer

A

collection of data designed for input, storage, search, and modification

Question 43

Q

DBMS (database management system)

Answer

A

set of programs that creates and maintains the database

Question 44

Q

relational database (RDBMS) and difference from flat files

Answer

A

data organized into tabular format with rows and columns following a well defined structure and schema. optimized for data operations and querying unlike flat files

Question 45

Q

non-relational databases (NoSQL)

Answer

A

built for speed, flexibility, and scale making it possible to store data in schema-less fashion

Question 46

Q

data warehouse

Answer

A

central repository for info from disparate sources consolidated through ETL process that enables analytics and BI

Question 47

Q

big data stores

Answer

A

distributed computational and storage infrastructure to store, scale, and process very large data sets

Question 48

Q

popular cloud relational databases services

Answer

A

amazon relational database service RDS
google cloud SQL
IBM db2 on cloud
oracle cloud
SQL Azure

Question 49

Q

advantages of relational databases

Answer

A

1 create meaningful info by joining tables
2 flexibility
3 reduced redundancy
4 ease of backup and disaster recovery
5 ACID compliance

Question 50

Q

ACID (Atomicity, Consistency, Isolation, Durability) compliance

Answer

A

data in database remains accurate, consistent, reliable despite failures

Question 51

Q

limitations of relational databases

Answer

A

1 doesnt work well with semi-structured or unstructured data

Question 52

Q

data warehouse typical architecture 3 tiers

Answer

A

1 bottom - database servers, that extract data from various sources
2 middle - OLAP server, that allows user to process and analyze info coming from multiple database servers
3 top - client front-end, tools and apps used for querying, reporting, analyzing

Question 53

Q

popularly used data warehouses

Answer

A

1 teradata enterprise data warehouse
2 oracle exadata
3 IBM Db2 warehouse on cloud
4 IBM netezza performance server
5 amazon redshift
6 BigQuery by Google
7 Cloudera's enterprise data hub
8 Snowflake cloud data warehouse

Question 54

Q

data mart

Answer

A

sub-section of data warehouse built specifically for a business function or community of users

Question 55

Q

types of data marts

Answer

A

dependent, independent, hybrid

Question 56

Q

dependent data mart

Answer

A

sub-section of data warehouse, offers analytical capabilities for restricted area of the data warehouse therefore providing isolated security and performance

Question 57

Q

independent data mart

Answer

A

created from sources other than enterprise data warehouse, like internal operating systems or external data

Question 58

Q

hybrid data mart

Answer

A

combine inputs from enterprise data warehouse, internal systems, and external data

Question 59

Q

data lake

Answer

A

data repository that can store large amounts of any type of data in their native format (raw)

Question 60

Q

benefits of data lakes

Answer

A

1 can store all types of data
2 can scale based on storage capacity
3 saves time of defining structures, schemas, and transformations
4 can repurpose data in different ways for many use cases

Question 61

Q

considerations for choice of data repository

Answer

A

1 types of data
2 schema of data
3 performance
4 whether data is at rest or streaming
5 data encryption needs
6 volume
7 storage requirements
8 frequency of access
9 organizations policies

Question 62

Q

data extraction types with description and tools

Answer

A

1 batch processing - data is moved in large chunks from source to target system - Stitch, Blendo
2 stream processing - moved in real-time and tranformed in transit - Apache Samza, Apache Storm, Apache Kafka

Question 63

Q

types of loading in ETL process with descriptions

Answer

A

1 initial - populating all data in repository
2 incremental - applying ongoing updates and mods periodically
3 full refresh - erasing contents of one or more tables and reloading with fresh data

Question 64

Q

popular ETL tools

Answer

A

1 IBM Infosphere
2 AWS Glue
3 Impravado
4 Skyvia
5 HEVO
6 Informatica PowerCenter

Answer 65

A

1 processing large sets of unstructured and non-relational data
2 shortened cycle between extraction and delivery
3 can ingest data immediately as available
4 greater flexibility for exploratory analytics

Answer 66

A

discipline of the practices, architectural techniques, and tools that allow orgs to ingest, transform, combine, and provision data across various data types

Answer 67

A

dynamic, large, and disparate volumes of data being created by people, tools, and machines

Answer 68

A

velocity, volume, variety, veracity, value

Answer 69

A

speed at which data accumulates

Answer 70

A

scale of the data

Answer 71

A

diversity of the data

Answer 72

A

quality and origin of data and conformity to facts and accuracy

Answer 73

A

ability and need to turn data into value

Answer 74

A

hadoop, hive, apache spark

Answer 75

A

collection of tools that provides distributred storage and processing of big data

Answer 76

A

data warehouse for data query and analysis built on top of hadoop

Answer 77

A

distributed data analytics framework designed to perform complex data analytics in real-time

Answer 78

A

1 Better real time data-driven decisions
2 improved data acess and analysis
3 data offload and consolidation

Answer 79

A

1 hadoop distributed file system (HDFS) is storage system for bid data that runs on multiple commodity hardware connected through network
2

Answer 80

A

1 fast recovery from hardware failures
2 access to streaming data because of high throughput rates
3 accommodation of large datasets because it can scale to hundreds of nodes in single cluster
4 portability, across multiple hardware platforms and compatible with multiple operating systems

Answer 81

A

1 data warehousing tasks such as ETL reporting, and data analysis
2 easy access to data via SQL

Answer 82

A

1 collection
2 storage and integration
3 processing
4 analysis and user interface
5 data pipeline

Answer 83

A

1 connect to sources
2 transfer data in streaming, batch, or both
3 maintain metadata of collection

Answer 84

A

google cloud DataFlow
IBM streams
IBM streaming analytics on cloud
amazon kinesis
apache kafka

Answer 85

A

1 store data for processing
2 transform and merge extracted data, logically or physically
3 make data available for processing in streaming or batch modes

Answer 86

A

ibm DB2
microsoft sql server
mysql
oracle database
postgreSQL

Answer 87

A

1 read data from storage and apply transformations
2 support popular querying tools and programming languages
3 scale to meet the processing demands of a growing dataset

Answer 88

A

1 type of data
2 volume
3 intended use
4 storage
5 privacy, security, and governance

Answer 89

A

capability to handle growth in the amount of data, workloads, and users

Answer 90

A

process of efficiently organizing data in a database

Answer 91

A

rate at which info can be read from and written to the storage and the time it takes to access a specific location

Answer 92

A

physical infrastructure
network
application
data

Answer 93

A

1 Confidentiality - through controlling unauthorized access
2 Integrity - through validating that your resources are trustworthy
3 Availability - ensuring users have access to resources when they need

Answer 94

A

data source typically used for capturing updated data from online forums and news sites

Answer 95

A

aws data exchange
crunchbase
lotame
snowflake

Answer 96

A

facilitate the exchange of data while ensuring security and governance maintained

Answer 97

A

combining data from different sources to provide combined view and a single interface where you can query and manipulate data

Answer 98

A

iterative process that involves data exploration, transformation, validation, and making data available

Answer 99

A

1 structuring - actions that change the form and schema of your data
2 normalization/denormalization - cleaning the database of unused data and reducing redundancy
3 cleaning - fix irregularities in data

Answer 100

A

scalability
app failures
scheduled jobs not starting on schedule
tool incompatibilities

Answer 101

A

1 latency - time it takes for a services to fulfill a request
2 failures - rate at which a service fails
3 resource utilitization
4 traffic - number of user requests received in a given period

Answer 102

A

1 collect as much info as possible
2 check if working with all the right versions of software
3 check the logs and metrics to isolate whether issue is related to infrastructure, data, software, or combo

Answer 103

A

1 system outages
2 capacity utilization
3 application slowdown
4 performance of queries
5 conflicting activities being executed by multiple users giving requests at the same time
6 batch activities

Answer 104

A

process of determining the optimal hardware and software resources required for performance

Answer 105

A

take frequent snapshots of the performance indicators of a database

Answer 106

A

help measure and monitor the performance of applications by tracking request response time and error messages and the amount of resources being utilized by each process

Answer 107

A

gather stats about query throughput, execution performance, resource utililization and patterns

Answer 108

A

de-identification process where personally identifiable info is replaced with artificial identifiers so data cant be tracked back to someones identity

Answer 109

A

software-based method of permanently clearing data from a system by overwriting

Answer 110

A

collaborative management practice focused on improving the communication, integration, and automation of data flows between data managers and consumers