Course 1: Introduction to Data Engineering Flashcards

1
Q

Entities that form a modern data ecosystem

A

1 Data integrated from disparate sources
2 diff types of analysis/skills to generate insights
3 stakeholders to act/collaborate on insights
4 tools, apps, infrastructure to store, process, disseminate data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Roles and Responsibilities of Data Engineers

A

1 Extract, integrate, and organize data from disparate sources
2 Clean, transform, and prep data
3 design, store, and manage data repositories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data Engineer Competencies

A

1 Programming
2 knowledge of systems and tech architectures
3 understanding of relational and non-relational databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Roles and Responsibilities of Data Analysts

A

1 Inspect and clean data for deriving insights
2 identify correlations/patterns and apply statistical methods to analyze and mine data
3 visualize data to interpret and present findings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data Analyst Competencies

A

1 good knowledge of spreadsheets, query writing, and statistical tools to create visuals
2 programming
3 analytical and story telling skills

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Roles and Responsibilities of Data Scientist

A

1 analyze data for actionable insights

2 build machine learning models or deep learning models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Scientist Competencies

A

1 mathematics
2 statistics
3 fair understanding of programming languages, databases, and building data models
4 domain knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Roles and Responsibilities of Business Analysts

A

1 leverage work of data analyst and scientists to look at implications for their business and recommend actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Roles and Responsibilities of BI Analysts

A

1 same as business analyst except focus is on market forces and external influences
2 provide BI solutions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

List tasks in typical data engineering lifecycle

A

1 collect data: by extracting, integrating, organizing data from disparate sources
2 process data: cleaning, transforming, prepping
3 storing data: for reliability, availability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Needs for collecting data

A

1 develop tools, workflows, processes
2 design, build, maintain scalable data architectures
3 store in databases, warehouses, lakes, other repositories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Needs for processing data

A

1 implement and maintain distributed systems for large-scale processing
2 design pipelines for extraction, transformation, and loading
3 design solutions for safeguarding, quality, privacy, and security
4 optimize tools, systems, and workflows for performance, reliability, and security
5 ensure regulatory and compliance guidelines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Needs for storing data

A

1 architect/implement data stores
2 ensure scalable systems
3 ensure tools/systems in place for privacy, security, compliance, monitoring, backup, and recovery
4 make data available to users through services, APIs, programs
5 interfaces and dashboards to present data
6 ensure measures/checks and balances in place for secure and right-based access

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Elements of data engineering ecosystem

A
1 data
2 data repositories
3 data integration platforms
4 data pipelines
5 languages
6 BI and reporting tools
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

structured data with examples

A

objective facts and numbers that can be collected, exported, stored, and organized in typical databases — SQL databases, spreadsheets, OLTP (online transaction processing) systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

semi-structured data with examples

A

some organizational properties but lacks rigid schema — emails, binary executables (TCP/IP packets), zipped files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

unstructured data with examples

A

does not have easily identifiable structure and cannot be organized in database of rows and columns — web pages, social media feeds, images, audio files, pdfs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

standard file formats

A
1 delimited text - .CSV
2 microsoft excel - .XML spreadsheet or .XLSX
3 extensible markup language - .XML
4 portable document - .PDF
5 javascript object notation - .JSON
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

delimited text file

A

1 store data as text
2 each value separated by delimiter which is one or more characters that act as boundary bw values
3 .CSV or .TSV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

microsoft excel file format

A

1 spreadsheet
2 open file format meaning accessible to other apps
3 can use and save all functions available in excel
4 secure format meaning it cannot save malicious code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

extensible markup language file format

A

1 markup language with set rules for encoding data
2 readable by humans and machines
3 self-descriptive language
4 platform and programming language independent
5 simpler to share between data systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

portable document file format

A

1 developed by adobe
2 present documents independent of app software, hardware, or operating systems
3 can be viewed same way on any device

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

javascript object notation file format

A
1 text-based open standard designed for transmitting structured data over web
2 language independent data format
3 can be read in any language
4 easy
5 compatible with wide array of browsers
6 one of the best tools for sharing data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

common sources of data

A
1 relational databases
2 flat files and XML databases
3 APIs and web services
4 web scraping
5 data streams and feeds
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

relational database examples

A

1 microsoft SQL server
2 oracle
3 MySQL
4 IBM db2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

APIs and web services

A

1 multiple users or apps can interact with and obtain data for processing/analysis
2 listens for incoming requests, in form of user web requests or network requests from apps
3 returns data in plain text, HTML, XML, JSON, or media files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

popular examples of APIs

A

twitter and facebook for tweets and posts
stock market APIs
data lookup and validation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

web scraping or screen scraping

A

1 download specific data based on defined parameters

2 can extract text, contact info, images, videos, product items, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

popular uses of web scraping or screen scraping

A

1 providing pricing comps by collecting product details from retailer eCommerce websites
2 generating sales leads thru public data
3 extracting data from posts and authors on various forums
4 collecting training and testing models for machine learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

data streams and feeds

A

aggregating streams of data from instruments, IoT devices, GPS data, computer programs, websites, social media posts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

popular data stream examples

A
1 stock market tickers for financial trading
2 retail transactions
3 surveillance and video feeds
4 social media feeds
5 sensors
6 web clicks
7 flight events
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

popular data stream technologies

A

1 kafka
2 apache spark
3 apache storm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

RSS (really simple syndication) feeds

A

capturing updated data from online forums and news sites where data is refreshed on ongoing basis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

types of languages with usage description

A

1 query - accessing and manipulating data
2 programming - developing apps and controlling app behavior
3 shell and scripting - ideal for repetitive and time-consuming operational tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

typical operations performed by shell scripts

A
1 file manipulation
2 program execution
3 system admin tasks
4 installation for complex programs
5 executing routine backups
6 running batches
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

what is PowerShell and what is it used for?

A
  • cross-platform automation tool and configuration framework by microsoft optimized for working with structured data
  • data mining, building GUIs, creating charts, dashboards, and interactive reports
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

metadata

A

data that provides info about other data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

3 main types of metadata with description

A

1 technical - defines data structures in repositories or platforms
2 process - processes that operate behind business systems like data warehouses, accounting systems, or CRM tools
3 business - info about data described in readily interpretable ways

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

metadata management

A

includes developing and administering policies and procedures to ensure info can be accessed and integrated from various sources and appropriately shared across enterprise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

why is metadata management important?

A

help understand both business context and data lineage, which helps improve data governance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

data repository

A

general term for data collected, organized, and isolated so that it can be used for business ops or mined for reporting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

database

A

collection of data designed for input, storage, search, and modification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

DBMS (database management system)

A

set of programs that creates and maintains the database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

relational database (RDBMS) and difference from flat files

A

data organized into tabular format with rows and columns following a well defined structure and schema. optimized for data operations and querying unlike flat files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

non-relational databases (NoSQL)

A

built for speed, flexibility, and scale making it possible to store data in schema-less fashion

46
Q

data warehouse

A

central repository for info from disparate sources consolidated through ETL process that enables analytics and BI

47
Q

big data stores

A

distributed computational and storage infrastructure to store, scale, and process very large data sets

48
Q

popular cloud relational databases services

A
amazon relational database service RDS
google cloud SQL
IBM db2 on cloud
oracle cloud
SQL Azure
49
Q

advantages of relational databases

A
1 create meaningful info by joining tables
2 flexibility
3 reduced redundancy
4 ease of backup and disaster recovery
5 ACID compliance
50
Q

ACID (Atomicity, Consistency, Isolation, Durability) compliance

A

data in database remains accurate, consistent, reliable despite failures

51
Q

limitations of relational databases

A

1 doesnt work well with semi-structured or unstructured data

52
Q

data warehouse typical architecture 3 tiers

A

1 bottom - database servers, that extract data from various sources
2 middle - OLAP server, that allows user to process and analyze info coming from multiple database servers
3 top - client front-end, tools and apps used for querying, reporting, analyzing

53
Q

popularly used data warehouses

A
1 teradata enterprise data warehouse
2 oracle exadata
3 IBM Db2 warehouse on cloud
4 IBM netezza performance server
5 amazon redshift
6 BigQuery by Google
7 Cloudera's enterprise data hub
8 Snowflake cloud data warehouse
54
Q

data mart

A

sub-section of data warehouse built specifically for a business function or community of users

55
Q

types of data marts

A

dependent, independent, hybrid

56
Q

dependent data mart

A

sub-section of data warehouse, offers analytical capabilities for restricted area of the data warehouse therefore providing isolated security and performance

57
Q

independent data mart

A

created from sources other than enterprise data warehouse, like internal operating systems or external data

58
Q

hybrid data mart

A

combine inputs from enterprise data warehouse, internal systems, and external data

59
Q

data lake

A

data repository that can store large amounts of any type of data in their native format (raw)

60
Q

benefits of data lakes

A

1 can store all types of data
2 can scale based on storage capacity
3 saves time of defining structures, schemas, and transformations
4 can repurpose data in different ways for many use cases

61
Q

considerations for choice of data repository

A
1 types of data
2 schema of data
3 performance
4 whether data is at rest or streaming
5 data encryption needs
6 volume
7 storage requirements
8 frequency of access
9 organizations policies
62
Q

data extraction types with description and tools

A

1 batch processing - data is moved in large chunks from source to target system - Stitch, Blendo
2 stream processing - moved in real-time and tranformed in transit - Apache Samza, Apache Storm, Apache Kafka

63
Q

types of loading in ETL process with descriptions

A

1 initial - populating all data in repository
2 incremental - applying ongoing updates and mods periodically
3 full refresh - erasing contents of one or more tables and reloading with fresh data

64
Q

popular ETL tools

A
1 IBM Infosphere
2 AWS Glue
3 Impravado
4 Skyvia
5 HEVO
6 Informatica PowerCenter
65
Q

advantages of ELT process

A

1 processing large sets of unstructured and non-relational data
2 shortened cycle between extraction and delivery
3 can ingest data immediately as available
4 greater flexibility for exploratory analytics

66
Q

data integration

A

discipline of the practices, architectural techniques, and tools that allow orgs to ingest, transform, combine, and provision data across various data types

67
Q

big data

A

dynamic, large, and disparate volumes of data being created by people, tools, and machines

68
Q

elements of big data

A

velocity, volume, variety, veracity, value

69
Q

big data velocity

A

speed at which data accumulates

70
Q

big data volume

A

scale of the data

71
Q

big data variety

A

diversity of the data

72
Q

big data veracity

A

quality and origin of data and conformity to facts and accuracy

73
Q

big data value

A

ability and need to turn data into value

74
Q

3 open source big data technologies

A

hadoop, hive, apache spark

75
Q

hadoop

A

collection of tools that provides distributred storage and processing of big data

76
Q

hive

A

data warehouse for data query and analysis built on top of hadoop

77
Q

spark

A

distributed data analytics framework designed to perform complex data analytics in real-time

78
Q

hadoop benefits

A

1 Better real time data-driven decisions
2 improved data acess and analysis
3 data offload and consolidation

79
Q

4 main hadoop components

A

1 hadoop distributed file system (HDFS) is storage system for bid data that runs on multiple commodity hardware connected through network
2

80
Q

HDFS benefits

A

1 fast recovery from hardware failures
2 access to streaming data because of high throughput rates
3 accommodation of large datasets because it can scale to hundreds of nodes in single cluster
4 portability, across multiple hardware platforms and compatible with multiple operating systems

81
Q

hive benefits

A

1 data warehousing tasks such as ETL reporting, and data analysis
2 easy access to data via SQL

82
Q

data platform layers

A
1 collection
2 storage and integration
3 processing
4 analysis and user interface
5 data pipeline
83
Q

data collection layer

A

1 connect to sources
2 transfer data in streaming, batch, or both
3 maintain metadata of collection

84
Q

data collection layer tools

A
google cloud DataFlow
IBM streams
IBM streaming analytics on cloud
amazon kinesis
apache kafka
85
Q

data storage layer

A

1 store data for processing
2 transform and merge extracted data, logically or physically
3 make data available for processing in streaming or batch modes

86
Q

data storage tools

A
ibm DB2
microsoft sql server
mysql
oracle database
postgreSQL
87
Q

data processing layer

A

1 read data from storage and apply transformations
2 support popular querying tools and programming languages
3 scale to meet the processing demands of a growing dataset

88
Q

primary considerations for designing a data store

A
1 type of data
2 volume
3 intended use
4 storage
5 privacy, security, and governance
89
Q

scalability

A

capability to handle growth in the amount of data, workloads, and users

90
Q

normalization of the database

A

process of efficiently organizing data in a database

91
Q

throughput or latency

A

rate at which info can be read from and written to the storage and the time it takes to access a specific location

92
Q

Facets of security in data lifecycle management

A

physical infrastructure
network
application
data

93
Q

3 components to creating an effective strategy for info security (known as CIA triad)

A

1 Confidentiality - through controlling unauthorized access
2 Integrity - through validating that your resources are trustworthy
3 Availability - ensuring users have access to resources when they need

94
Q

Rss feeds

A

data source typically used for capturing updated data from online forums and news sites

95
Q

popular data exchange platforms

A

aws data exchange
crunchbase
lotame
snowflake

96
Q

data exchange platforms

A

facilitate the exchange of data while ensuring security and governance maintained

97
Q

importing data process

A

combining data from different sources to provide combined view and a single interface where you can query and manipulate data

98
Q

data wrangling

A

iterative process that involves data exploration, transformation, validation, and making data available

99
Q

transformation tasks with definition

A

1 structuring - actions that change the form and schema of your data
2 normalization/denormalization - cleaning the database of unused data and reducing redundancy
3 cleaning - fix irregularities in data

100
Q

types of performance threats to data pipelines

A

scalability
app failures
scheduled jobs not starting on schedule
tool incompatibilities

101
Q

performance metrics for a data pipleline with definition

A

1 latency - time it takes for a services to fulfill a request
2 failures - rate at which a service fails
3 resource utilitization
4 traffic - number of user requests received in a given period

102
Q

steps to troubleshoot performance issues in data pipeline

A

1 collect as much info as possible
2 check if working with all the right versions of software
3 check the logs and metrics to isolate whether issue is related to infrastructure, data, software, or combo

103
Q

performance metrics for a database

A
1 system outages
2 capacity utilization
3 application slowdown
4 performance of queries
5 conflicting activities being executed by multiple users giving requests at the same time
6 batch activities
104
Q

capacity planning

A

process of determining the optimal hardware and software resources required for performance

105
Q

database monitoring tools def

A

take frequent snapshots of the performance indicators of a database

106
Q

app management performance management tools def

A

help measure and monitor the performance of applications by tracking request response time and error messages and the amount of resources being utilized by each process

107
Q

query performance monitoring tools def

A

gather stats about query throughput, execution performance, resource utililization and patterns

108
Q

pseudonymization

A

de-identification process where personally identifiable info is replaced with artificial identifiers so data cant be tracked back to someones identity

109
Q

data erasure

A

software-based method of permanently clearing data from a system by overwriting

110
Q

DataOps

A

collaborative management practice focused on improving the communication, integration, and automation of data flows between data managers and consumers