Big Data Engineering Flashcards
What is Data Engineering
The practice of designing and building systems for data aggregation, storage, and analysis at scale.
Define Data Analytics.
The science of fusing heterogeneous data, identifying relationships, making predictions, and supporting decision-making.
What is Big Data?
Large volumes of structured or unstructured data with high variety, velocity, and complexity.
What are the 5 Vs of Big Data?
Volume: Large amounts of data
Variety: Different types and sources
Velocity: Speed of data generation
Veracity: Trustworthiness of data
Value: Business and strategic importance
Why is Big Data significant?
Increasing data generation
Improved data storage and analysis capabilities
High business and research value
What are the major classifications of Data Analytics?
Descriptive Analytics: Summarizing past data
Predictive Analytics: Forecasting future trends
What are key processes in the Big Data lifecycle?
- Acquisition
- Extraction
- Integration
- Analysis
- Interpretation
- Decision-making
Describe the contents of a Data Lake
Stores raw data
Stores any type of structured or unstructured data
Agile and low cost
Used for ML/IoT
Describe contents of a Data Warehouse
Stores processed data
Stores structured data
Less agile and expensive
Used for business intelligence, healthcare analytics
What does 1.acquisition require?
Selection
Filtering
Metadata generation
Managing provenance
What does 2.extraction require?
Transformation
Normalization
Cleaning
Aggregation
Error handling
What does 3.integration require?
Standardization
Conflict management
Reconciliation
Mapping definition
What does 4.analysis require?
Exploration
Data mining
Machine Learning
Visualization
What does 5.interpretation require?
Knowledge of the domain
Knowledge of the provenance
Identification of patterns of interest
Flexibility of the process
What does 6.decision require?
Managerial skills
Continuous improvement of the project
What is the software stack for data analytics?
- Ingestion
- Storage (HDFS)
- Data preprocessing (Spark)
- Knowledge extraction