3er Parcial Flashcards

1
Q

What are the steps of big data processing?

A
  1. Collect
  2. Store
  3. Process/Analyze
  4. Consume
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data collection gets divided into 3 categories:

A
  1. Transactions
  2. Files-object
  3. Event
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are transactions?

A

They are data structures and database records, data that is typically comig from web and mobile applications.

Typically stored in database systems (NoSQL, SQL, In-memory)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are files-object?

A

Media files and log files.
Typically stored in a file-object store.

Managed in Amazon S3 (Simple Storage Service) to build a data lake framework.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is metadata?

A

Data that provides information about other data and helps understand it.

To manage it we use tools like AWS Glue catalog, fully managed data catalog that uses crawlers (app to detect new data) and search (metadata discovery).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are events?

A

Data streams called event. Typically stored in a stream storage.

Amazon Kineses, Apache Kafka, Amazon MSK (Managed Service for Kafka),

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a data lake?

A

A centralized repository for storing raw data (structured, unstructured and semi-structured).
The data is stored in a variety of fomats, offering flexibility and scalability for different types of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a data warehouse?

A

A centralized repository that stores large volumes of data (structured and processed) that is organized ready for analysis.
Data is cleaned, trasnformed and organized into a specific structure when written into the warehouse.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are crawlers?

A

Automated programs used by search engines to systematically browse and index content from websites.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are batch analytics?

A

Data analysis for large volumes of data that gets processed as a whole.

Reports that are monthly, weekly or daily.

Takes minutes to hours.

ie. Financial insitutions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are interactive analytics?

A

Data analysis for real-time or near real-time queries on demand processing.

Answers within seconds.

ie. Business intelligence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are stream analytics?

A

Data analysis for continuous real time data.

One-minute metrics.
Takes miliseconds to seconds.
ie. Fraud-alerts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are predictive analytics?

A

Data analysis that uses statistical techniques and machine learning to analyze historical data and predict future trends and events.

Miliseconds (real-time) to minutes (batch)

ie. Fraud detection, forecasting demand

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Where is the biggest data center concentration located?

A

Loudon County, Virginia
70% of all web traffic goes there.
10 million ft^2 in 70 buildings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Greenpeace measured that if you put together all the data centers in the world, it would be the ___ ___ greatest electricity consumer in the world.

A

5th

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

AWS is growing _ _% a year

A

40

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a buffer?

A

a temporary storage area in a computer’s memory (RAM) or disk that is used to hold data while it is being transferred from one place to another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Where did the term “cloud” originate?

A

It was introduced by Eric Schmidt on august 9th, 2006 at the “Search Engine Strategies” Conference
Google Services as belonging inng “in a cloud somewhere”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the life cycle of information?

A

Input->Capture->Manage & Store->Deliver->Output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are data sets?

A

A collection of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is big data?

A

Data that is so voluminous and complex that data-processing application software are inadequate to deal with them.

15
Q

What are some big data challenges?

A
  • capturing data
  • data storage
  • data analysis
  • search
  • sharing
  • transfer
  • visualization
  • querying
  • updating
  • information privacy
  • data source.
16
Q

1024 gigabytes=

A

1 terabyte

17
Q

1024 terabytes=

A

1 petabyte

18
Q

1024 petabytes=

A

1 exabyte

19
Q

1024 exabytes=

A

1 zettabyte

20
Q

1024 zettabytes=

A

1 yottabyte

21
Q

What is the master-slave model?

A

A system architecture where the coordinator (“master”) controls one or more entities (“slaves”) giving commands so the slaves can perform specific tasks.

22
Q

What is a node?

A

Any physical or virtual device, component or element that is part of a network or system that can send, receive or process data.

It can be a VM, a computer, a server, etc.

23
Q

What is Spark?

A

Apache Spark is a unified analytics engine for large-scale data processing.

100x faster than Hadoop.

24
Q

What are the differences between Hadoop and Spark?

A

On Hadoop you write the info, storage it, upload it, process and get it back then write again. Hadoop moves data through disk & network.

write->store & upload->process->repeat

On Spark you read everything, process all the info and save. Spark caches data in memory.

write->process-> store & upload

25
Q

What is Hadoop?

A

Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and procesing of big data using the MapReduce programming.

26
Q

What is I/O (Input/Output)?

A

It refers to the communication between a computer system and the outside world, involving the exchange of data between the computer’s hardware and other devices or systems.

27
Q

What roles take care of building predictive models and designing the framework for efficient data management?

A

Data Scientist, Data Architect

28
Q

What does an organization really look at when processing data?

A

The presentation and insights that we get from ingestion and processing.

29
Q

What is NiFi?

A

A powerful tool for automating and managing the flow of data between systems. It facilitates data movement, pipelines and monitoring with the usage of built-in processors and components.

30
Q

CRM/ERPS are _____ data.

A

structured

31
Q

Social Media is ___ data.

A

unstructured

32
Q

What are some tools used for data ingestion?

A

Scoop, Kafka, Flume
They only take the data from the source.
ie. social media

33
Q

What are some Storage and Analysis tools?

A

Spark, HBase, Hadoop, BigQuery
They get the data from the ingestion tools and filter, process and sort info.

34
Q

What are some presentation/architectures tools?

A

PowerBI, Kibana, Amazon Quicksight
Used to get insights and visualize information.

35
Q

What is velocity on 5V?

A

The speed at which data is generated, collected and processed, its categorized into:
- batch
- intervals
- proces
- stream
- real-time

36
Q

What does variety refer to for 5V?

A

The diversity or different forms of data.
80% unstructured
20% structured

37
Q

What is a CRM?

A

Customer Relationship Management
A database for relationships (table view)
The most important kind of database.

38
Q

What is an ERP?

A

Enterprise Resource Planning
A type of software used by organizations to manage and integrate core business processes across departments and CRM.

39
Q

What is sequential processing?

A

Data that is processed one step at a time by a single machine.

40
Q

What is parallel processing?

A

Data that gets processed simultaneously by multiple machines or processors at the same time.

41
Q

How does management divide for IaaS?

A

you manage:
- applications
- data
- runtime
- middleware
- O/S
the provider:
- virtualization
- servers
- storage
- networking

42
Q

How does management divide for PaaS?

A

you manage:
- applications
- data
the provider:
- runtime
- middleware
- O/S
- virtualization
- servers
- storage
- networking

43
Q

How does management divide for SaaS?

A

the provider:
- applications
- data
- runtime
- middleware
- O/S
- virtualization
- servers
- storage
- networking

44
Q

What is T2M?

A

The amount of time it takes for a product or service to be launched.