3er Parcial Flashcards
What are the steps of big data processing?
- Collect
- Store
- Process/Analyze
- Consume
Data collection gets divided into 3 categories:
- Transactions
- Files-object
- Event
What are transactions?
They are data structures and database records, data that is typically comig from web and mobile applications.
Typically stored in database systems (NoSQL, SQL, In-memory)
What are files-object?
Media files and log files.
Typically stored in a file-object store.
Managed in Amazon S3 (Simple Storage Service) to build a data lake framework.
What is metadata?
Data that provides information about other data and helps understand it.
To manage it we use tools like AWS Glue catalog, fully managed data catalog that uses crawlers (app to detect new data) and search (metadata discovery).
What are events?
Data streams called event. Typically stored in a stream storage.
Amazon Kineses, Apache Kafka, Amazon MSK (Managed Service for Kafka),
What is a data lake?
A centralized repository for storing raw data (structured, unstructured and semi-structured).
The data is stored in a variety of fomats, offering flexibility and scalability for different types of data.
What is a data warehouse?
A centralized repository that stores large volumes of data (structured and processed) that is organized ready for analysis.
Data is cleaned, trasnformed and organized into a specific structure when written into the warehouse.
What are crawlers?
Automated programs used by search engines to systematically browse and index content from websites.
What are batch analytics?
Data analysis for large volumes of data that gets processed as a whole.
Reports that are monthly, weekly or daily.
Takes minutes to hours.
ie. Financial insitutions
What are interactive analytics?
Data analysis for real-time or near real-time queries on demand processing.
Answers within seconds.
ie. Business intelligence
What are stream analytics?
Data analysis for continuous real time data.
One-minute metrics.
Takes miliseconds to seconds.
ie. Fraud-alerts
What are predictive analytics?
Data analysis that uses statistical techniques and machine learning to analyze historical data and predict future trends and events.
Miliseconds (real-time) to minutes (batch)
ie. Fraud detection, forecasting demand
Where is the biggest data center concentration located?
Loudon County, Virginia
70% of all web traffic goes there.
10 million ft^2 in 70 buildings
Greenpeace measured that if you put together all the data centers in the world, it would be the ___ ___ greatest electricity consumer in the world.
5th
AWS is growing _ _% a year
40
What is a buffer?
a temporary storage area in a computer’s memory (RAM) or disk that is used to hold data while it is being transferred from one place to another.
Where did the term “cloud” originate?
It was introduced by Eric Schmidt on august 9th, 2006 at the “Search Engine Strategies” Conference
Google Services as belonging inng “in a cloud somewhere”
What is the life cycle of information?
Input->Capture->Manage & Store->Deliver->Output
What are data sets?
A collection of data.