Monitoring ML solutions Flashcards
Why is privacy a critical consideration in AI, and how does it relate to Google’s AI principles?
Privacy is integral to ethical AI design because:
It adheres to legal and regulatory standards.
Aligns with social norms and individual expectations.
Safeguards sensitive information.
Privacy is a cornerstone of Google’s fifth AI principle: Incorporate privacy design principles, ensuring AI systems respect user data.
What are sensitive attributes, and how do they impact AI system design?
Sensitive attributes include personally identifiable information (PII) and other critical data, such as:
PII: Names, addresses, SSNs.
Social Data: Ethnicity, religion.
Health Data: Diagnoses, genetic information.
Financial Data: Credit card details, income.
Biometric Data: Fingerprints, facial recognition.
AI systems must handle sensitive data with heightened security and legal compliance, as misuse can result in privacy violations and user mistrust.
What are common de-identification techniques in AI, and their benefits and drawbacks?
Redaction: Deletes sensitive data; irreversible but may reduce model utility.
Replacement: Substitutes values; irreversible, can impact learning.
Masking: Hides parts of data; retains structure but not the original value.
Tokenization: Maps data to unique tokens; reversible, vulnerable to attacks.
Bucketing: Groups numeric data into ranges; reduces granularity.
Shifting: Randomizes timestamps; preserves sequence but is reversible.
Each technique balances privacy and utility based on context.
Explain k-anonymity and l-diversity. How do they enhance privacy?
k-Anonymity: Ensures each record is indistinguishable from at least k-1 others, reducing re-identification risks.
l-Diversity: Ensures that each anonymized group has l distinct sensitive values, addressing homogeneity in k-anonymized data.
These methods collectively enhance privacy while maintaining data utility.
How does differential privacy protect individual data during analysis?
Differential privacy ensures that the inclusion or exclusion of any individual’s data minimally affects the analysis outcome by:
Adding calibrated noise.
Preventing sensitive attribute identification.
Providing strong, mathematically proven privacy guarantees through parameters like epsilon (privacy strength).
What are the trade-offs involved in setting epsilon for differential privacy?
Lower Epsilon: Stronger privacy, but higher noise can degrade data utility.
Higher Epsilon: Less privacy, but better model accuracy.
Selecting epsilon involves balancing privacy with analytical and model performance.
What is DP-SGD, and how does it enhance model training security?
Differentially Private Stochastic Gradient Descent (DP-SGD) integrates differential privacy into SGD by:
Gradient Clipping: Limits the influence of individual samples.
Noise Addition: Protects data during updates. This method is easily implemented using libraries like TensorFlow Privacy.
Describe federated learning and its advantages for privacy.
Federated learning trains models locally on user devices, sharing only gradients with central servers:
Preserves data privacy by avoiding raw data transfer.
Supports personalization, e.g., Gboard predictions.
Updates central models without exposing sensitive user inputs.
What are key privacy challenges in federated learning?
Membership Inference Attacks:
Revealing if specific data points were used.
Sensitive Property Breaches: Exposing private attributes.
Model Poisoning: Malicious users manipulate training data to degrade models.
How does secure aggregation enhance privacy in federated learning?
Secure aggregation encrypts user gradients before sharing with central servers:
Ensures gradients are only decrypted after aggregation.
Protects individual data contributions.
How does Google Cloud prevent training data extraction attacks in generative AI?
Google Cloud:
Excludes customer data from training foundation models.
Encrypts data at rest and in transit.
Ensures generated content cannot reveal specific training data.
What are the risks of training data extraction attacks, and how do they occur?
Risks:
Revealing sensitive information (e.g., addresses).
Violating user privacy.
These occur through iterative prompt crafting to extract memorized training examples from generative models.
How does Google ensure privacy compliance in its AI/ML systems?
Privacy by Default: No customer data in foundation models.
Encryption: TLS in transit, Customer-Managed Encryption Keys (CMEK).
Access Control: IAM for minimal privilege.
How does the Cloud Data Loss Prevention API support sensitive data protection?
The API:
Detects PII in structured/unstructured data.
Applies de-identification techniques like masking and tokenization.
Monitors re-identification risks.
Why is encryption critical for AI systems, and how does Google implement it?
Encryption ensures data security:
Default Encryption: For data at rest and in transit.
Cloud KMS: Centralized management of cryptographic keys.
What rules does IAM enforce to ensure secure access control in Google Cloud?
IAM enforces:
Least-privilege access.
Fine-grained roles for resources.
Audit trails to monitor actions.
What is differential privacy’s role in federated learning?
It prevents gradient leaks by:
Adding noise to gradients before aggregation.
Ensuring individual updates cannot be inferred.
What are the security concerns specific to generative AI models?
Memorization of sensitive data.
Output leakage via prompts.
Vulnerability to adversarial prompts.
How does Google secure generative AI inference pipelines?
Encrypts inputs and outputs in transit.
Stores tuned weights securely.
Provides CMEK for customer-managed encryption.
Summarize the privacy principles applied in AI/ML by Google.
Data Minimization: Collect only necessary data.
Transparency: Document usage and policies.
Security: Encrypt, monitor, and audit all interactions.
What is the relationship between AI safety and Google’s AI principles?
AI safety is grounded in Google’s AI principles, specifically:
Principle 3: “Be built and tested for safety,” emphasizing robust testing to minimize risks.
Principle 2: Avoid creating or reinforcing unfair bias.
Principle 6: Ensure accountability to people, promoting transparency and oversight.
AI safety overlaps with fairness and accountability, ensuring ethical use.
What makes safety more challenging in generative AI compared to discriminative AI models?
Unknown Output Space: Generative AI can produce unexpected and creative outputs, making prediction difficult.
Diverse Training Data: Models trained on large datasets might generate outputs significantly different from the input data.
Adversarial Inputs: Generative AI is more prone to malicious prompt exploitation.
Unlike discriminative models (e.g., classifiers), generative models require extensive safeguards to manage risks.
What are the two primary approaches to AI safety?
Technical Approach: Implements engineering solutions, such as model safeguards, input-output filters, and adversarial testing.
Institutional Approach (AI Governance): Focuses on industry-wide policies, national regulations, and ethical guidelines to govern AI use.
Both approaches complement each other.
What are input and output safeguards in generative AI systems?
Input Safeguards: Block or rewrite harmful prompts before processing.
Output Safeguards: Detect and mitigate unsafe outputs using classifiers, error messages, or response ranking based on safety scores.
These safeguards ensure compliance with safety standards.
Explain adversarial testing and its significance in AI safety.
Adversarial testing evaluates how an AI system responds to malicious or harmful inputs by:
Creating test datasets with edge cases and adversarial examples.
Running model inference on the dataset to identify failures.
Annotating and analyzing outputs for policy violations.
It guides model improvements and informs product launch decisions.
Differentiate between malicious and inadvertently harmful inputs.
Malicious Inputs: Explicitly designed to elicit harmful responses (e.g., asking for hate speech).
Inadvertently Harmful Inputs: Benign inputs that result in harmful outputs due to biases or context sensitivity (e.g., stereotypes in descriptions).
Both require mitigation through testing and safeguards.
What are some common ways a generative AI can fail to meet guidelines?
Generating harmful content (e.g., hate speech).
Revealing PII or SPII.
Producing biased or unethical outputs.
Misaligning with user contexts.
Avoiding these requires robust safety frameworks.
How can safety classifiers mitigate harmful content in generative AI?
Safety classifiers evaluate inputs and outputs based on predefined harm categories (e.g., hate speech, explicit content) and suggest actions:
Block harmful inputs.
Rewrite risky prompts.
Rank outputs by safety scores.
Examples: Google’s Perspective API and OpenAI’s Moderation API.
What is the role of human oversight in AI safety workflows?
Validate classifier predictions.
Annotate complex or subjective outputs (e.g., hate speech).
Correct errors in automated processes.
Human-in-the-loop mechanisms ensure accountability for high-risk applications.
Describe instruction fine-tuning and its relevance to AI safety.
Instruction fine-tuning teaches models safety-related tasks using curated datasets with specific instructions:
Embed safety concepts (e.g., toxic language detection).
Reduce harmful outputs by training on safety-related scenarios.
This enhances model alignment with human values.
What is RLHF, and how does it embed safety into AI systems?
Reinforcement Learning from Human Feedback (RLHF) involves:
Training a reward model using human preferences.
Iteratively fine-tuning models to align with the reward model.
Evaluating responses for safety and helpfulness.
RLHF integrates safety preferences into AI systems effectively.
What is constitutional AI, and how does it enhance safety training?
Constitutional AI is a method for training AI systems to be helpful, honest, and harmless. It uses a set of principles to guide AI behavior and self-improvement, without relying on human feedback.
CAI’s principles are based on legal frameworks and constitutional principles, and include: Human rights, Privacy protections, Due process, and Equality before the law.
Constitutional AI uses:
Self-Critique: AI revises its outputs to align with predefined principles.
RLAIF: Reinforcement Learning from AI Feedback. AI moderates and creates preference datasets for safety fine-tuning.
This reduces reliance on manual supervision.
How do safety thresholds in Gemini API ensure content safety?
Gemini API provides adjustable thresholds:
Block Low: Restricts content with even low probability of being unsafe.
Block Medium: Default threshold for most use cases.
Block High: For lenient safety requirements.
These thresholds align with use-case-specific needs.
How does Google Cloud’s Natural Language API support AI safety?
It provides text moderation capabilities by:
Classifying content based on safety attributes.
Assigning confidence scores for each category.
Allowing customizable thresholds for moderation decisions.
Explain the trade-offs between safety and fairness in training AI models.
Enhanced Safety: Filtering toxic data reduces harmful outputs but risks over-correcting for sensitive topics.
Fairness Impact: Filtering can suppress representation of marginalized groups, limiting diversity in outputs.
Balancing these requires nuanced dataset curation and tuning.
How do lexical and semantic diversity impact adversarial datasets?
Lexical Diversity: Ensures varied vocabulary for better robustness testing.
Semantic Diversity: Covers a broad range of meanings and contexts.
Both dimensions enhance the effectiveness of adversarial testing.
What role does safety evaluation play and how does this affect product launch decisions?
Safety evaluation identifies unmitigated risks, such as:
Likelihood of policy violations.
Potential harm to users.
Findings guide safeguards and launch readiness.
How does prompt engineering support safety in generative AI?
Prompt engineering:
Shapes inputs to reduce risky outputs.
Uses control tokens or style transfers to steer model behavior.
Works alongside tuned models for maximum safety.
What are semi-scripted outputs, and when are they useful?
Semi-scripted outputs:
Combine AI generation with pre-defined messages.
Explain safety restrictions to users effectively.
They enhance transparency while mitigating harmful responses.
What are the safety categories and confidence levels used in Gemini?
Categories include harassment, hate speech, sexually explicit, and dangerous content.
Confidence levels: Negligible, Low, Medium, and High.
Thresholds determine whether content is blocked or allowed.
What are Google’s AI principles related to fairness, and why is it important in machine learning?
Google’s second AI principle is to avoid creating or reinforcing unfair bias. Fairness in AI ensures equity, inclusion, and ethical decision-making across diverse applications, including high-stakes domains like healthcare, hiring, and lending. Achieving fairness mitigates negative societal impacts and fosters trust in AI systems.
Define bias in the context of AI, and provide examples of five common biases. (data collection biases)
Bias refers to stereotyping or favouritism towards certain groups or perspectives, often due to data or model design.
Examples:
Reporting Bias: Over-representation of unusual events in datasets.
Automation Bias: Over-reliance on AI outputs, even if incorrect.
Selection Bias: Non-representative data sampling.
Group Attribution Bias: Generalizing traits from individuals to groups.
Implicit Bias: Hidden assumptions based on personal experience.
What is selection bias, and what are its three subtypes?
Selection bias occurs when a dataset does not reflect real-world distributions. Subtypes:
Coverage Bias: Incomplete representation of groups.
Non-Response Bias: Gaps due to lack of participation.
Sampling Bias: Non-randomized data collection.
What causes bias during the ML lifecycle, and how can it be mitigated?
Bias can arise during:
Data Collection: Sampling and reporting errors.
Model Training: Amplification of biases in training data.
Evaluation and Deployment: Feedback loops introducing new biases.
Mitigation includes careful dataset curation, bias-aware training, and post-deployment monitoring.
How is fairness defined, and why is it difficult to standardize?
Fairness is context-dependent, encompassing equity and inclusion across sensitive variables like gender and ethnicity. Standardization is challenging because:
Fairness criteria vary across cultural, legal, and social contexts.
Metrics can be incompatible (e.g., demographic parity vs. equality of opportunity).
Explain TensorFlow Data Validation (TFDV) and its role in identifying data bias.
TFDV supports:
Data Exploration: Provides statistical summaries (e.g., mean, std dev).
Data Slicing: Analyzes subsets (e.g., location-based distributions).
Schema Inference: Automates validation criteria.
Anomaly Detection: Flags issues like missing values or skewed distributions.
What is the What-If Tool, and how does it facilitate fairness analysis?
The What-If Tool allows:
Visualization of dataset interactions and model predictions.
Counterfactual Analysis: Tests sensitivity to feature changes.
Flip rate metrics: Quantifies prediction changes when sensitive features vary.
Slicing: Evaluates performance across demographic groups.
How does TensorFlow Model Analysis (TFMA) assist in fairness evaluation?
TFMA:
Analyzes model performance using fairness metrics.
Slices data by sensitive features (e.g., racial group) to detect gaps.
Automates validation in MLOps pipelines.
Links to fairness indicators for deeper insights.
What techniques can mitigate bias during data preparation?
Diversify data sources (e.g., new data collection).
Balance datasets via upsampling or downsampling.
Use synthetic data to augment underrepresented groups.
Relabel data to correct harmful or outdated labels.
Describe the Monk Skin Tone (MST) scale and its purpose in fairness.
The MST scale, developed in partnership with Google, provides a 10-shade range for evaluating skin tone representation in datasets. It ensures inclusivity and mitigates biases in facial recognition or image-based systems.
How does threshold calibration address fairness issues in ML systems?
Threshold calibration adjusts classification cutoffs for fairness.
Example: In loan approvals, thresholds can be tuned separately for groups (e.g., based on demographic parity or equality of opportunity) to address systemic disparities.
What are demographic parity and equality of opportunity?
Demographic Parity: Equal prediction rates across groups.
Equality of Opportunity: Equal true positive rates for eligible groups.
Each aligns fairness goals with specific use cases (e.g., access vs. success rates).
How do MinDiff and Counterfactual Logit Pairing (CLP) improve fairness during model training?
MinDiff: Minimizes prediction distribution gaps across sensitive subgroups.
CLP: Reduces sensitivity to changes in counterfactual examples by penalizing inconsistent logits during training.
What is flip rate, and why is it important in fairness evaluation?
Flip rate measures how frequently predictions change when sensitive features are altered (e.g., gender). A lower flip rate indicates higher robustness and fairness.
How can fairness trade-offs be addressed in ML systems?
Fairness trade-offs require prioritization based on context:
Define fairness metrics relevant to stakeholders.
Use tools like the Aequitas Fairness Tree for guidance.
Balance conflicting goals through iterative evaluation.
How does relabeling data mitigate bias in models?
Relabeling corrects harmful annotations and updates to modern standards.
Example: Sentiment analysis for movie reviews may remove stereotypical labels to prevent biased associations.
What challenges arise when training models on synthetic data?
Models may overfit synthetic patterns, leading to performance issues.
Domain gaps can complicate adaptation to real-world data.
Synthetic examples may unintentionally introduce biases.
Describe the fairness factors tested in threshold calibration.
Fairness constraints include:
Demographic Parity: Equal outcomes across groups.
Equality of Odds: Equal error rates (false positives/negatives) across groups.
Equality of Opportunity: Equal true positive rates.
What is counterfactual fairness, and how does CLP enforce it?
Counterfactual fairness ensures predictions are unaffected by sensitive attribute changes. CLP enforces it by minimizing prediction differences in counterfactual scenarios using added loss terms.
How can fairness indicators in TFMA guide decision-making?
Fairness indicators in TFMA evaluate model performance using multiple fairness metrics, identifying trade-offs and guiding actions like threshold adjustments or retraining with MinDiff or CLP.
What is Responsible AI, and why is it necessary?
Responsible AI refers to the ethical development and deployment of AI systems by understanding and mitigating issues, limitations, or unintended consequences. It ensures that AI is socially beneficial, trustworthy, and accountable. Without Responsible AI practices, even well-intentioned systems can cause ethical issues, reduce user trust, or fail to achieve their intended benefits.
What are Google’s AI principles, and how do they guide AI development?
Google’s AI principles provide a framework for developing ethical AI:
Be socially beneficial.
Avoid creating or reinforcing unfair bias.
Be built and tested for safety.
Be accountable to people.
Incorporate privacy design principles.
Uphold high standards of scientific excellence.
Be made available for beneficial uses aligned with these principles.
They guide AI projects by setting boundaries on what is acceptable, ensuring safety, fairness, and accountability.
What are the four areas in which Google will not pursue AI applications?
Google will not pursue AI applications in the following areas:
Technologies that cause or are likely to cause harm.
Weapons or technologies designed to facilitate injury.
Technologies for surveillance that violate internationally accepted norms.
Technologies contravening widely accepted principles of international law and human rights.
How does responsible AI differ from legal compliance?
Responsible AI extends beyond legal compliance:
Ethics: Focuses on what ought to be done, even if laws don’t mandate it.
Law: Codified rules derived from ethical principles.
Responsible AI incorporates ethical considerations, such as fairness and accountability, that may not yet be codified in regulations.
Why is fairness a central theme in Responsible AI?
Fairness ensures AI systems do not create or reinforce biases related to sensitive characteristics like race, gender, or ability. It is context-dependent and requires continuous evaluation to prevent harm or inequity, especially in high-stakes applications like hiring or criminal justice.
What role do humans play in Responsible AI?
Humans are central to Responsible AI:
Design datasets and models.
Make deployment decisions.
Evaluate and monitor performance. Human decisions reflect personal values, which underscores the need for diverse perspectives and ethical considerations throughout the AI lifecycle.
What are the six recommended practices for Responsible AI development?
Use a human-centered design approach.
Define and assess multiple metrics during training and monitoring.
Directly examine raw data.
Be aware of dataset and model limitations.
Test the system thoroughly to ensure proper functioning.
Continuously monitor and update the system post-deployment.
What is human-centered design, and why is it important for Responsible AI?
Human-centered design focuses on understanding how users interact with AI systems:
Involves diverse user groups to ensure inclusivity.
Models adverse feedback early in the design process.
Ensures clarity, control, and actionable outputs for users.
How does Google Flights incorporate Responsible AI practices?
Google Flights employs:
Transparency: Explaining predictions and data sources.
Actionable Insights: Providing clear indicators like “high,” “typical,” or “low” prices.
Iterative User Research: Adapting design based on user trust and understanding.
Why is transparency critical in Responsible AI?
Transparency builds trust by:
Allowing users to understand how decisions are made.
Offering explanations for predictions and recommendations.
Ensuring ethical practices and accountability.
How does monitoring improve Responsible AI systems post-deployment?
Monitoring ensures models remain effective in dynamic real-world conditions by:
Detecting input drift.
Gathering user feedback.
Updating models based on new data and behaviours.
What are the risks of failing to build trust in AI systems?
Reduced adoption by users or organizations.
Ethical controversies or public backlash.
Potential harm to stakeholders affected by AI decisions.
How can metrics ensure Responsible AI development?
Metrics provide quantitative benchmarks for:
User feedback.
System performance.
Equity across demographic subgroups. Metrics like recall and precision ensure models align with their intended goals.
What is the significance of explainability in Responsible AI?
Explainability allows:
Stakeholders to understand and trust AI outputs.
Identification of biases or errors in decision-making.
Users to appeal or challenge AI-based decisions.
How can raw data examination improve Responsible AI outcomes?
Analyzing raw data ensures:
Data accuracy and completeness.
Representation of all user groups.
Mitigation of training-serving skew and sampling bias.
What is training-serving skew, and how can it be mitigated?
Training-serving skew occurs when data used in training differs from real-world serving data.
Mitigation involves:
Adjusting training objectives.
Ensuring representative evaluation datasets.
What role does the “poka-yoke” principle play in Responsible AI testing?
The poka-yoke principle builds quality checks into systems to:
Prevent failures (e.g., missing features triggering system alerts).
Ensure AI outputs only when conditions are met.
Why is iterative user testing crucial for Responsible AI?
Iterative testing:
Captures diverse user needs and perspectives.
Identifies unintended consequences.
Improves system usability and trustworthiness.
What are Google’s design principles for price intelligence in Google Flights?
The design principles are:
Honest: Provide clear and truthful insights.
Actionable: Help users make informed decisions.
Concise yet explorable: Deliver useful summaries with deeper details available.
How does Responsible AI contribute to innovation?
Ethical development fosters:
Increased trust in AI systems.
Better adoption rates in enterprises.
Encouragement of creative, user-focused solutions that align with societal values.
Explain the core architectural concept of TensorFlow’s computation model and how it enables language and hardware portability.
TensorFlow uses a directed acyclic graph (DAG) to represent computations. This graph is a language-independent representation that allows the same model to be:
Built in Python
Stored in a saved model
Restored and executed in different languages (e.g., C++)
Run on multiple hardware platforms (CPUs, GPUs, TPUs)
This approach is analogous to Java’s bytecode and JVM, providing a universal representation that can be efficiently executed across different environments. The TensorFlow execution engine, written in C++, optimizes the graph for specific hardware capabilities, enabling flexible model deployment from cloud training to edge device inference.
Describe the TensorFlow API hierarchy and explain the significance of each layer of abstraction.
TensorFlow’s API hierarchy consists of:
1) Hardware Implementation Layer: Low-level platform-specific implementations
2) C++ API: For creating custom TensorFlow operations
3) Core Python API: Numeric processing (add, subtract, matrix multiply)
4) Python Modules: High-level neural network components (layers, metrics, losses)
5) High-Level APIs (Keras, Estimators):
Simplified model definition
Distributed training
Data preprocessing
Model compilation and training
Checkpointing and serving
The hierarchy allows developers to choose the appropriate level of abstraction, from low-level hardware manipulation to high-level model creation with minimal code.
What are tensors in TensorFlow, and how do they differ from traditional arrays?
Tensors are n-dimensional arrays of data in TensorFlow, characterized by:
Scalars (0D): Single numbers
Vectors (1D): Arrays of numbers
Matrices (2D): Rectangular arrays
3D/4D Tensors: Stacked matrices with increasing dimensions
Key differences from traditional arrays:
Can be created as constants (tf.constant) or variables (tf.variable)
Variables allow modifiable values, critical for updating model weights
Support automatic differentiation
Designed for efficient numerical computation across different hardware
Explain the concept of automatic differentiation in TensorFlow using GradientTape.
Automatic differentiation in TensorFlow allows automatic calculation of partial derivatives through:
Forward Pass: TensorFlow records operations in order
Backward Pass: Uses GradientTape to:
Track operations executed within its context
Compute gradients using reverse-mode differentiation
Enable automatic calculation of derivatives for loss functions
The process involves:
Tracking computational graph operations
Storing operation sequence
Reversing the graph to compute gradients
Supporting custom gradient calculations for numerical stability or optimization
How does TensorFlow enable model portability between cloud and edge devices?
TensorFlow facilitates model portability through:
Training models on powerful cloud infrastructure
Exporting trained models to edge devices (mobile phones, embedded systems)
Reducing model complexity for edge deployment
Enabling offline inference
Practical example: Google Translate app
Full translation model trained in the cloud
Reduced, optimized model stored on the phone
Allows offline translation
Trades some model complexity for:
Faster response times
Reduced computational requirements
Enhanced privacy
Improved user experience
What is the significance of tf.variable in TensorFlow model training?
tf.variable is crucial for machine learning because:
Represents trainable parameters (weights, biases)
Allows modification during training
Supports assignment methods (assign, assign_add, assign_sub)
Fixes type and shape after initial construction
Enables automatic gradient computation
Tracks parameters that change during optimization processes
Key characteristics:
Mutable tensor type
Essential for updating neural network weights
Integral to gradient-based learning algorithms
Supports efficient parameter updates
Describe the shape manipulation techniques in TensorFlow for tensor transformations.
TensorFlow provides several tensor shape manipulation methods:
Stacking: Combining tensors along new dimensions
Increases tensor rank
Creates higher-dimensional representations
Slicing: Extracting specific tensor segments
Zero-indexed access
Can extract rows, columns, or specific elements
Reshaping (tf.reshape):
Changes tensor dimensions while preserving total element count
Rearranges elements systematically
Maintains data integrity across transformations
Example: 2x3 matrix can be reshaped to 3x2 by row-wise element redistribution
These techniques enable flexible data preprocessing and feature engineering in machine learning workflows.
Explain how TensorFlow supports distributed machine learning training.
TensorFlow supports distributed machine learning through:
High-level APIs handling distributed training complexities
Automatic device placement
Memory management across multiple devices/machines
Seamless scaling of training processes
Key distributed training capabilities:
Parallel computing across GPUs/TPUs
Synchronization of model parameters
Efficient gradient aggregation
Abstraction of low-level distributed computing details
Support for various distribution strategies
Recommended approach: Use high-level APIs like Estimators to manage distributed training complexity.