Module 1 Themes Flashcards
What is Data Governance?
The process of managing and overseeing data policies, metadata, and lineage tracking tools. Key tools include Egeria and Apache Atlas, which enable compliance, data quality improvement, and collaborative metadata management.
What is Data Lake Management?
The management of vast data repositories that store raw data for processing. Tools like Kylo help build and manage data ingestion pipelines, offering features such as self-service data preparation and quality monitoring.
What is Version Control?
A system for managing changes to source code over time. Git, GitHub, and GitLab are key tools that provide distributed repositories, CI/CD pipelines, and collaborative project management.
What is Bias Mitigation?
The identification and correction of unfair biases in machine learning models. AI Fairness 360 evaluates fairness metrics and applies algorithms to reduce bias, crucial in sensitive domains like hiring and healthcare.
What is Model Interpretability?
The ability to explain and understand machine learning model predictions. AI Explainability 360 offers tools for feature importance analysis, rule-based explanations, and counterfactual reasoning.
What is Adversarial Testing?
The practice of testing machine learning models against adversarial attacks to ensure robustness. Adversarial Robustness 360 provides techniques like adversarial training and preprocessing.
What are Monitoring Systems?
The tools used to monitor application performance, metrics, and alerts. Prometheus supports real-time analytics with time-series data collection and alerting capabilities.
What are Machine Learning Pipelines?
Frameworks and tools for executing scalable machine learning workflows. Apache Spark focuses on batch processing, while Apache Flink specializes in real-time stream processing.
What are Integrated Visual Tools?
Visual platforms that allow drag-and-drop functionality for tasks like data transformation and model building. Node-RED supports integration workflows, and TensorFlow Lite focuses on embedded machine learning deployment.
What is Data Streaming?
Platforms enabling real-time data processing and analytics. Apache Kafka is a distributed streaming platform that ensures scalability and fault-tolerant data pipelines.