3er Parcial Flashcards

Question 1

Q

What are the steps of big data processing?

Answer

A

Collect
Store
Process/Analyze
Consume

Question 2

Q

Data collection gets divided into 3 categories:

Answer

A

Transactions
Files-object
Event

Question 3

Q

What are transactions?

Answer

A

They are data structures and database records, data that is typically comig from web and mobile applications.

Typically stored in database systems (NoSQL, SQL, In-memory)

Question 4

Q

What are files-object?

Answer

A

Media files and log files.
Typically stored in a file-object store.

Managed in Amazon S3 (Simple Storage Service) to build a data lake framework.

Question 5

Q

What is metadata?

Answer

A

Data that provides information about other data and helps understand it.

To manage it we use tools like AWS Glue catalog, fully managed data catalog that uses crawlers (app to detect new data) and search (metadata discovery).

Question 6

Q

What are events?

Answer

A

Data streams called event. Typically stored in a stream storage.

Amazon Kineses, Apache Kafka, Amazon MSK (Managed Service for Kafka),

Question 7

Q

What is a data lake?

Answer

A

A centralized repository for storing raw data (structured, unstructured and semi-structured).
The data is stored in a variety of fomats, offering flexibility and scalability for different types of data.

Question 8

Q

What is a data warehouse?

Answer

A

A centralized repository that stores large volumes of data (structured and processed) that is organized ready for analysis.
Data is cleaned, trasnformed and organized into a specific structure when written into the warehouse.

Question 9

Q

What are crawlers?

Answer

A

Automated programs used by search engines to systematically browse and index content from websites.

Question 10

Q

What are batch analytics?

Answer

A

Data analysis for large volumes of data that gets processed as a whole.

Reports that are monthly, weekly or daily.

Takes minutes to hours.

ie. Financial insitutions

Question 11

Q

What are interactive analytics?

Answer

A

Data analysis for real-time or near real-time queries on demand processing.

Answers within seconds.

ie. Business intelligence

Question 12

Q

What are stream analytics?

Answer

A

Data analysis for continuous real time data.

One-minute metrics.
Takes miliseconds to seconds.
ie. Fraud-alerts

Question 13

Q

What are predictive analytics?

Answer

A

Data analysis that uses statistical techniques and machine learning to analyze historical data and predict future trends and events.

Miliseconds (real-time) to minutes (batch)

ie. Fraud detection, forecasting demand

Question 14

Q

Where is the biggest data center concentration located?

Answer

A

Loudon County, Virginia
70% of all web traffic goes there.
10 million ft^2 in 70 buildings

Question 15

Q

Greenpeace measured that if you put together all the data centers in the world, it would be the ___ ___ greatest electricity consumer in the world.

Question 16

Q

AWS is growing _ _% a year

Question 17

Q

What is a buffer?

Answer

A

a temporary storage area in a computer’s memory (RAM) or disk that is used to hold data while it is being transferred from one place to another.

Question 18

Q

Where did the term “cloud” originate?

Answer

A

It was introduced by Eric Schmidt on august 9th, 2006 at the “Search Engine Strategies” Conference
Google Services as belonging inng “in a cloud somewhere”

Question 19

Q

What is the life cycle of information?

Answer

A

Input->Capture->Manage & Store->Deliver->Output

Question 20

Q

What are data sets?

Answer

A

A collection of data.

Question 21

Q

What is big data?

Answer

A

Data that is so voluminous and complex that data-processing application software are inadequate to deal with them.

Question 22

Q

What are some big data challenges?

Answer

A

capturing data
data storage
data analysis
search
sharing
transfer
visualization
querying
updating
information privacy
data source.

Question 23

Q

1024 gigabytes=

Answer

A

1 terabyte

Question 24

Q

1024 terabytes=

Answer

A

1 petabyte

Question 25

Q

1024 petabytes=

Answer

A

1 exabyte

Question 26

Q

1024 exabytes=

Answer

A

1 zettabyte

Question 27

Q

1024 zettabytes=

Answer

A

1 yottabyte

Question 28

Q

What is the master-slave model?

Answer

A

A system architecture where the coordinator (“master”) controls one or more entities (“slaves”) giving commands so the slaves can perform specific tasks.

Question 29

Q

What is a node?

Answer

A

Any physical or virtual device, component or element that is part of a network or system that can send, receive or process data.

It can be a VM, a computer, a server, etc.

Question 30

Q

What is Spark?

Answer

A

Apache Spark is a unified analytics engine for large-scale data processing.

100x faster than Hadoop.

Question 31

Q

What are the differences between Hadoop and Spark?

Answer

A

On Hadoop you write the info, storage it, upload it, process and get it back then write again. Hadoop moves data through disk & network.

write->store & upload->process->repeat

On Spark you read everything, process all the info and save. Spark caches data in memory.

write->process-> store & upload

Question 32

Q

What is Hadoop?

Answer

A

Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and procesing of big data using the MapReduce programming.

Question 33

Q

What is I/O (Input/Output)?

Answer

A

It refers to the communication between a computer system and the outside world, involving the exchange of data between the computer’s hardware and other devices or systems.

Question 34

Q

What roles take care of building predictive models and designing the framework for efficient data management?

Answer

A

Data Scientist, Data Architect

Question 35

Q

What does an organization really look at when processing data?

Answer

A

The presentation and insights that we get from ingestion and processing.

Question 36

Q

What is NiFi?

Answer

A

A powerful tool for automating and managing the flow of data between systems. It facilitates data movement, pipelines and monitoring with the usage of built-in processors and components.

Question 37

Q

CRM/ERPS are _____ data.

Answer

A

structured

Question 38

Q

Social Media is ___ data.

Answer

A

unstructured

Question 39

Q

What are some tools used for data ingestion?

Answer

A

Scoop, Kafka, Flume
They only take the data from the source.
ie. social media

Question 40

Q

What are some Storage and Analysis tools?

Answer

A

Spark, HBase, Hadoop, BigQuery
They get the data from the ingestion tools and filter, process and sort info.

Question 41

Q

What are some presentation/architectures tools?

Answer

A

PowerBI, Kibana, Amazon Quicksight
Used to get insights and visualize information.

Question 42

Q

What is velocity on 5V?

Answer

A

The speed at which data is generated, collected and processed, its categorized into:
- batch
- intervals
- proces
- stream
- real-time

Question 43

Q

What does variety refer to for 5V?

Answer

A

The diversity or different forms of data.
80% unstructured
20% structured

Question 44

Q

What is a CRM?

Answer

A

Customer Relationship Management
A database for relationships (table view)
The most important kind of database.

Question 45

Q

What is an ERP?

Answer

A

Enterprise Resource Planning
A type of software used by organizations to manage and integrate core business processes across departments and CRM.

Question 46

Q

What is sequential processing?

Answer

A

Data that is processed one step at a time by a single machine.

Question 47

Q

What is parallel processing?

Answer

A

Data that gets processed simultaneously by multiple machines or processors at the same time.

Question 48

Q

How does management divide for IaaS?

Answer

A

you manage:
- applications
- data
- runtime
- middleware
- O/S
the provider:
- virtualization
- servers
- storage
- networking

Question 49

Q

How does management divide for PaaS?

Answer

A

you manage:
- applications
- data
the provider:
- runtime
- middleware
- O/S
- virtualization
- servers
- storage
- networking

Question 50

Q

How does management divide for SaaS?

Answer

A

the provider:
- applications
- data
- runtime
- middleware
- O/S
- virtualization
- servers
- storage
- networking

Question 51

Q

What is T2M?

Answer

A

The amount of time it takes for a product or service to be launched.