3er Parcial Flashcards
What are the steps of big data processing?
- Collect
- Store
- Process/Analyze
- Consume
Data collection gets divided into 3 categories:
- Transactions
- Files-object
- Event
What are transactions?
They are data structures and database records, data that is typically comig from web and mobile applications.
Typically stored in database systems (NoSQL, SQL, In-memory)
What are files-object?
Media files and log files.
Typically stored in a file-object store.
Managed in Amazon S3 (Simple Storage Service) to build a data lake framework.
What is metadata?
Data that provides information about other data and helps understand it.
To manage it we use tools like AWS Glue catalog, fully managed data catalog that uses crawlers (app to detect new data) and search (metadata discovery).
What are events?
Data streams called event. Typically stored in a stream storage.
Amazon Kineses, Apache Kafka, Amazon MSK (Managed Service for Kafka),
What is a data lake?
A centralized repository for storing raw data (structured, unstructured and semi-structured).
The data is stored in a variety of fomats, offering flexibility and scalability for different types of data.
What is a data warehouse?
A centralized repository that stores large volumes of data (structured and processed) that is organized ready for analysis.
Data is cleaned, trasnformed and organized into a specific structure when written into the warehouse.
What are crawlers?
Automated programs used by search engines to systematically browse and index content from websites.
What are batch analytics?
Data analysis for large volumes of data that gets processed as a whole.
Reports that are monthly, weekly or daily.
Takes minutes to hours.
ie. Financial insitutions
What are interactive analytics?
Data analysis for real-time or near real-time queries on demand processing.
Answers within seconds.
ie. Business intelligence
What are stream analytics?
Data analysis for continuous real time data.
One-minute metrics.
Takes miliseconds to seconds.
ie. Fraud-alerts
What are predictive analytics?
Data analysis that uses statistical techniques and machine learning to analyze historical data and predict future trends and events.
Miliseconds (real-time) to minutes (batch)
ie. Fraud detection, forecasting demand
Where is the biggest data center concentration located?
Loudon County, Virginia
70% of all web traffic goes there.
10 million ft^2 in 70 buildings
Greenpeace measured that if you put together all the data centers in the world, it would be the ___ ___ greatest electricity consumer in the world.
5th
AWS is growing _ _% a year
40
What is a buffer?
a temporary storage area in a computer’s memory (RAM) or disk that is used to hold data while it is being transferred from one place to another.
Where did the term “cloud” originate?
It was introduced by Eric Schmidt on august 9th, 2006 at the “Search Engine Strategies” Conference
Google Services as belonging inng “in a cloud somewhere”
What is the life cycle of information?
Input->Capture->Manage & Store->Deliver->Output
What are data sets?
A collection of data.
What is big data?
Data that is so voluminous and complex that data-processing application software are inadequate to deal with them.
What are some big data challenges?
- capturing data
- data storage
- data analysis
- search
- sharing
- transfer
- visualization
- querying
- updating
- information privacy
- data source.
1024 gigabytes=
1 terabyte
1024 terabytes=
1 petabyte
1024 petabytes=
1 exabyte
1024 exabytes=
1 zettabyte
1024 zettabytes=
1 yottabyte
What is the master-slave model?
A system architecture where the coordinator (“master”) controls one or more entities (“slaves”) giving commands so the slaves can perform specific tasks.
What is a node?
Any physical or virtual device, component or element that is part of a network or system that can send, receive or process data.
It can be a VM, a computer, a server, etc.
What is Spark?
Apache Spark is a unified analytics engine for large-scale data processing.
100x faster than Hadoop.
What are the differences between Hadoop and Spark?
On Hadoop you write the info, storage it, upload it, process and get it back then write again. Hadoop moves data through disk & network.
write->store & upload->process->repeat
On Spark you read everything, process all the info and save. Spark caches data in memory.
write->process-> store & upload
What is Hadoop?
Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and procesing of big data using the MapReduce programming.
What is I/O (Input/Output)?
It refers to the communication between a computer system and the outside world, involving the exchange of data between the computer’s hardware and other devices or systems.
What roles take care of building predictive models and designing the framework for efficient data management?
Data Scientist, Data Architect
What does an organization really look at when processing data?
The presentation and insights that we get from ingestion and processing.
What is NiFi?
A powerful tool for automating and managing the flow of data between systems. It facilitates data movement, pipelines and monitoring with the usage of built-in processors and components.
CRM/ERPS are _____ data.
structured
Social Media is ___ data.
unstructured
What are some tools used for data ingestion?
Scoop, Kafka, Flume
They only take the data from the source.
ie. social media
What are some Storage and Analysis tools?
Spark, HBase, Hadoop, BigQuery
They get the data from the ingestion tools and filter, process and sort info.
What are some presentation/architectures tools?
PowerBI, Kibana, Amazon Quicksight
Used to get insights and visualize information.
What is velocity on 5V?
The speed at which data is generated, collected and processed, its categorized into:
- batch
- intervals
- proces
- stream
- real-time
What does variety refer to for 5V?
The diversity or different forms of data.
80% unstructured
20% structured
What is a CRM?
Customer Relationship Management
A database for relationships (table view)
The most important kind of database.
What is an ERP?
Enterprise Resource Planning
A type of software used by organizations to manage and integrate core business processes across departments and CRM.
What is sequential processing?
Data that is processed one step at a time by a single machine.
What is parallel processing?
Data that gets processed simultaneously by multiple machines or processors at the same time.
How does management divide for IaaS?
you manage:
- applications
- data
- runtime
- middleware
- O/S
the provider:
- virtualization
- servers
- storage
- networking
How does management divide for PaaS?
you manage:
- applications
- data
the provider:
- runtime
- middleware
- O/S
- virtualization
- servers
- storage
- networking
How does management divide for SaaS?
the provider:
- applications
- data
- runtime
- middleware
- O/S
- virtualization
- servers
- storage
- networking
What is T2M?
The amount of time it takes for a product or service to be launched.