NVIDIA AI final Flashcards by James Stone

Question

Answer

How well did you know this?

Not at all

Perfectly

Unit 4. Employs algorithms and statistical models that enable computer systems to find patterns in massive amounts of data, and then uses a model that recognizes those patterns to make predictions or descriptions on new data. Is this Deep, Machine, Neural, Deep Neural Network learning.

Unit 4. Machine Learning

How well did you know this?

Not at all

Perfectly

Unit 4. This framework is an essential tool for Data Scientists. This is a Computer Vision, Natural Language Processing, Speech and Audio Processing, Robot learning more. Interface. Library or Tool. What is this framework? AI, ML, DNN, MDL framework.

Unit 4. Machine Deep Learning Frameworks

How well did you know this?

Not at all

Perfectly

Unit 4. PyTorch Geometric DGL and others rely on these libraries such as cuDNN, NCCL and DALI to deliver high-performance, accelerated training. What is this type of accelerated training. Is this Deep Learning, Machine accelerated, GPU accelerated, AI?

Unit 4. GPU Accelerated

How well did you know this?

Not at all

Perfectly

Unit 4. This framework offers building blocks for designing, training and validating deep neural networks through a high-level programming interface. Widely used frameworks such as PyTorch and TensorFlow. Is this AI, DL DNN, or ML frameworks.

Unit 4. Deep Learning Frameworks

How well did you know this?

Not at all

Perfectly

Unit 4. A sub-class of Machine Learning. It uses neural networks to train a model. Using very large data sets. In the range of Terabytes or more of data. Is the answer: Machine Learning, AI, or Deep Learning or Deep Neural Network approach.

Unit 4. Deep Learning Approach

How well did you know this?

Not at all

Perfectly

Unit 4. This type of Neural Network model are Algorithms that mimic the human brain in understanding complex patterns. Once trained, on new images, it can make predictions. What is this type of Neural Network Model?

Unit 4. Deep Neural Network Model

How well did you know this?

Not at all

Perfectly

Unit 4. What is this type of training data? It is a set of data with “_ _ _ _ _ ” that help the neural network learn. These “ _ _ _ _ _” can be the objects in the images: cars, trucks, cranes. The error that the classifier makes on the training data are used to incrementally improve the network structure.

…Name this type of training data (- - - - - -)

Unit 4. ‘Labels’ as in labeled Training Data

How well did you know this?

Not at all

Perfectly

Unit 4. Once the neural network based model is trained it can make this type of “predictions” on new images. Once trained the network and classifier are deployed against previously unseen data, which is not labeled. If the training was done correctly, the network will be able to apply its feature representation to correctly classify similar classes in different situations. These “predictions” are also referred to a certain “class”

Unit 4. Object Class Predictions

How well did you know this?

Not at all

Perfectly

Unit 4. A modern Open Source Machine Deep learning framework used to train and deploy deep neural networks. It is scalable allowing for fast model training, and supports a flexible programming modem and multiple languages. This type of library is portable and can scale to multiple GPU’s and multiple machines.

Unit 4. Machine Deep Learning Frameworks - MXNet

How well did you know this?

Not at all

Perfectly

Unit 4. Machine DL Frameworks. This free software machine learning scientific library (framework) for Python Program language features various classification, regression and clustering algorithms. Choose mxnet, scikits-learn or tensorflow.

Unit 4. Machine Deep Learning Frameworks - SciKit Learn
…and is designed to interoperate with the Python numerical and scientific libraries.

How well did you know this?

Not at all

Perfectly

Unit 4. This is an essential tool for Data Scientist in the Machine Deep Learning Framework. It is also a popular Open source software library (framework) for dataflow programming across a range of tasks. It is a symbolic math library and is commonly used for deep learning applications.
Is it MXNet, or SciKit-learn or TensorFlow

Unit 4. Machine Deep Learning Frameworks - Tensor Flow

How well did you know this?

Not at all

Perfectly

Unit 4. This Nvidia Deep Learning Software Stack is comprised of Host OS and NVIDIA Driver, NGC Container, DL Frameworks

Unit 4. Nvidia Deep Learning Software Stack

How well did you know this?

Not at all

Perfectly

Unit 4. This Nvidia Deep Learning Software Stack “OS” enables the deep learning framework to use the GPU functions

Unit 4 Host OS and Nvidia Drive

How well did you know this?

Not at all

Perfectly

Unit 4. These publicly available containers, are optimized to run NVIDIA GPU’s in the Nvidia Deep Learning Software Stack.

Unit 4 NGC Container

How well did you know this?

Not at all

Perfectly

Unit 4. This popular type of framework(s) is available inside the containers for Nvidia Deep Learning Software Stack. Is it ML, AI, DL, DNN?

Unit 4. DL or deep learning Frameworks

How well did you know this?

Not at all

Perfectly

Unit 4. Nvidia Deep Learning Software Stack - The name for Nvidia’s groundbreaking parallel programming model that provides essential optimization for deep learning.

Unit 4. A CUDA MATADA

How well did you know this?

Not at all

Perfectly

Unit 4 Accelerate data preparation, Model Training, Visualization with this type of software stack

Unit 4 Machine Learning Software Stack

How well did you know this?

Not at all

Perfectly

Unit 4 Machine Learning Software Stack “Columnar name” in memory data structure “_ _ _ _ _ _” arrow

Unit 4 Apache arrow (Machine Learning Software Stack) which Delivers efficient and fast data interchange with the flexibility to support complex data models. What is the Columnar name referred to as

How well did you know this?

Not at all

Perfectly

Unit 4. A suite of open source software libraries and API’s which offers the ability to execute end to end data science and analytics for executing data science pipelines, entirely on GPU’s. And can “reduce” training times from days to minutes. Built on NVIDIA® CUDA-X AI.

Unit 4. RAPIDS (Machine Learning Software Stack)

How well did you know this?

Not at all

Perfectly

(Unit 4) A framework and collection of graph analytics libraries that seamlessly integrates into the RAPIDS data science platform Tensor RT

Unit 4. CUGRAPH (Machine Learning Software Stack) Nvidia GPU Software Ecosystem.

How well did you know this?

Not at all

Perfectly

Unit 4. A Dataframe manipulation library based on Apache Arrow that accelerates loading, filtering and manipulation of data for model training data preparation. dask, cudf, cuml, cudnn

Unit 4. CUDF (Machine Learning Software Stack)

How well did you know this?

Not at all

Perfectly

Unit 4. A collection of GPU accelerated machine learning libraries that will provide GPU versions of all machine learning algorithms available, including SciKit-learn Knn, Kmeans, Random Forest and Regressions. Is it rapids, cuml, dask, python

Unit 4. CUML (Machine learning software stack)

How well did you know this?

Not at all

Perfectly

Unit 4. Give users the ability to run jobs in the map reduce style of programming. Which allows pipelines to stage data in main memory, if everything doesn’t fit in GPU memory. cuml, cudf, dask, cugraph

Unit 4. DASK (Machine Learning Software Stack)

How well did you know this?

Not at all

Perfectly

Unit 4. Developers use this language which is a simple programming language, to develop models using the above libraries.

Unit 4. Python

Unit 4. This Is a collection of software acceleration libraries built on top of CUDA and over 13 other libraries. Increase productivity.

Unit 4. Nvidia CUDA-X AI Ecosystem - CUDA-X AI

Unit 4 Nvidia Deep Learning Software Stack for Accelerating Deep learning primitives. Is this cuDL or cuDNN or cuML or python?

Unit 4. Nvidia CUDA-X AI Ecosystem - CUDA-X-AI - cuDNN

Unit 4. Accelerating data science workflows and "Machine Learning" algorithms ecosystem. Name this Nvidia CUDA-X AI Ecosystem - CUDA-X-AI

Unit 4. Nvidia CUDA-X AI Ecosystem - CUDA-X-AI - cuML

Unit 4. Nvidia Deep Learning Software Stack Nvidia Optimizing Trained Models for Inference

Unit 4. Nvidia CUDA-X AI Ecosystem - CUDA-X-AI-NVIDIA TensorRT

Unit 4. Nvidia Deep Learning Software Stack This Nvidia CUDA hardware provides a data frame manipulation library ecosystem. Name this CUDA-X-AI-NVIDIA DL software stack.

Unit 4. Nvidia CUDA-X AI Ecosystem -CUDA-X-AI-NVIDIA cuDF

Unit 4. Nvidia Deep Learning Software Stack For performing high-performance analytics on Graphs

Unit 4. Nvidia CUDA-X AI Ecosystem-CUDA-X-AI-NVIDIA cuGraph

Unit 4. Nvidia Deep Learning Software Stack cuDNN, cuML, TensorRT,cuDF, cuGraph, together they work seamlessly with this Nvidia product to accelerate the development and deployment of AI based applications. Is it DL, ML, AI, Tensore core

Unit 4. Nvidia CUDA-X AI Ecosystem -CUDA-Xi-AI Nvidia Tensor Core GPU

Unit 4. Nvidia Deep Learning Software Stack This type of Nvidia CUDA-X framework is used for Desktops, workstations, servers, cloud computing deployments and software acceleration libraries.

Unit 4. Nvidia CUDA-X AI Ecosystem - Frameworks, Cloud ML, Deployments

Unit 1. Clinical care, Operational Efficiency (no-shows), Precision Medicine (radiomics, one size fits all), Drug Discovery (monitoring). Is it ML, DL, or AI

Unit 1. AI in Healthcare

Applications include Radiomics (biomarker), At-Risk Patients, Medical billing, Disease/Genetic correlation, Medical Transcription, Drug Interactions, Cancer Detection.

Applications of AI in Healthcare

A broad field of study focused on using computers to do things that require human-level intelligence

Artificial Intelligence

An approach that uses statistical learning Algorithm

Machine Learning

A techinque inspired by how human beings learn.

Deep Learning

Where Computations can run on CPI cores and on GPU's

Compute Nodes (AI Cluster Components)

Where data is stored

Storage Nodes (AI Cluster Components)

6.1 These types of nodes for Multi System AI cluster are used for system monitoring, provisioning, troubleshooting. Services required can include user authentication, network proxies, workload, data, fabric, system management and monitoring and general user and acess and services. Tip:Containerization tools such as Docker are often used to separate and manage devices. Reliable, resilient and robust servers are often required to ensure a highly available system.

6.1 Management Nodes (AI Cluster Components)

6.1 What is used for AI Cluster Components, that connects compute nodes, storage nodes, and management network services. It is also used specifically for when the nodes are powered off. (In Band, Out of band, rubber band, or GUI, IPMI Networking)

6.1 Out of Band Networking (AI Cluster Components)

6.1 These type of nodes are for GPU based servers provide most of the computational resources and more power efficient. All components must keep up. Sharing data across multiple systems and multiple users.

6.1 Compute nodes (AI Cluster)

6.1 Provides nodes functionality rack of servers into a system. Service required user authentication, network proxies, workload, data, fabric, system management and monitoring. Is it Network, Switch, Storage, Out of band

6.1 Management Nodes (AI Cluster)

Connects Compute Nodes

Computer Network (AI Cluster)

Connects storage nodes

Storage Network (AI Cluster)

Used by all services necessary for system to operate

Management Network (AI Cluster)

Provides Best Practices to Design Systems for AI Workloads. Provides proven designs that organizations can leverage for their own needs as well as a recipe for getting started. (Model, Container, Reference)

Reference Architectures

This Nvidia DGX device for Reference Archie. With two - eight DGX A100 systems, compute servers, Nvidia storage partners

NVIDIA DGX POD

This Nvidia DGX device uses Configurations starting with 20, infused with expertise, designed to support widest range of DL and HPC workloads.

NVIDIA DGX SuperPOD

(Unit 6.2) Training DL and ML Models Requires "Massive Datasets" to obtain high accuracy.This increase in complexity leads to increased accuracy. What is this type of consideration for AI Workloads. Is it DL, ML, AI, DNN, RA, cuFL, storage?

Storage for AI Workloads Unit 6.2

6.2 Data should be visible, labeled, resiliency, recall, reconstructed, controls, vet, monitor, robust, end user needs, high perf, shared, data stewardship. These are the AI Characteristics of this type of "data" systems.

6.2 Storage Systems Characteristics for AI

6.2 These questions should be asked when deciding on this one, specific, type of data solution...How often will it be accessed, How often will it be written too, How often will it be read, when will it retired, what if there are system failures, Will this be fast storage, Is the data private...once again, these are questions for a very particular type of data solution.

6.2 Deciding on a storage solution. When deciding on a storage solution the full life cycle data should be considered.

(Unit 6.2) This is a type of storage... Simpler than Traditional shared f/s Scale storage massively (PB) High level of data protection via data replication Traditionally used in large cloud data storage repos No directory structure, files are referenced by keys Files are accessed via a REST API not a standards, or pplications must be re-written to directly access data

Object storage Unit 6.2 Storage considerations.

6.2 SQL, NoSQL, SQL-Like databases. Unique perfo. Charact. Access methods. Not as general as other fs types belong in this category of data storage systems...not parallel or distributed but "- - - - -" data storage systems.

6.2 Other Data Storage Systems

6.2 This type of storage file system (in a data hierarchy) can share data, group servers, scale out, and it can offer the highest read and write speeds...It is not NFS or Local. It is:

6.2 Parallel and Distributed fs...Storage systems data hierachy

6.2 This filesystem can provide a local like view of data to a group of servers, Often accomplished using open standards based protocols. However, it is not local or parallel. It most often uses remote ssh commands

6.2 NFS Storage systems data hierarchy

6.2 This type of file system is fast, strong performance, simple, not shared and not NFS.

6.2 Local FS

6.2 AI apps need large storage, that is read, IO focus, but should also have good write IO performance. This approach is key to this type of storage solution.

6.2 Overall Storage Solution

6.2 Each piece of this simpler type of datashierarchy presented has an important role to play in storing AI data and models

6.2 Storage HierArchi

6.2 These pieces can all be combined into this type of "- - - - -" tiered approach to balance this type of storage hierarchical strength

6.2 Multi-Tiered Storage HierArchy

6.2 Train ML and DL read repeat, random, optimize storage, read and re-read, read IO and write IO, larger models, storage with fast read IO and data cache many time offers best..Are all key characteristics to understanding this type of IO access.

6.2 Understanding Data Access

6.2 Data is the most important asset, many differences are shared for this type of consideration. DL training read and re-read, rate, IO model size, model train same time, ample storage, partners reduce time...These are are all a certain type of data consideration.

6.2 Storage Considerations

6.2q Bandwidth, IOPS, MetaData Ops

6.2q Storage Performance

6.2q Control should be in place for steward, data should be shared, High performance.

6.2q Key Characteristics of Storage Systems for AI

6.2q Data Records are repeatedly accessed in random...read Models get larger in size...write Many models are trained at the same time...needs

Fast read IO with cache Fast Write IO Amplify storage needs

Computer, Storage, In-Band Management, Out of Band Management

AI Cluster Networks

6.3 This type of "- - - - - - -" network, as it relates to AI cluster Networks, Maximizes the performance of AI workloads. It is also designed to minimizes system bottlenecks. Provides redundancy in the even of hardware failure aand minimizing costs.

6.3 Compute Network

6.3 This type of network, as it relates to AI Cluster Networks, provides high throughput access to shared storage, high bandwidth capabilities with advanced fabric management features, and provides significant benefits for the storage fabric...This is a "s - - - - - -" network

6.3 Storage Network

6.3 This type of network, as it relates to AI Cluster Networks, provides Connectivitiy to the management nodes, ssh, dns, NFS code repositories. Its the primary network for the everything that isn't related to inter-job communication or high-speed storage access. This is "-- - ----" Management Network

6.3 In Band Network

6.3 This type of Network, related to AI Clusters, Provides remote management functions even if servers are off on not reachable on the in-band. Remote Power Control, remote serial console, temperature and power sensors, separate network.

6.3 Out of Band Network

6.3 This is one important characteristic regarding the GPU...Network is crucial for max. GPU ACC. Data sets grow, increase IO bottleneck, storage and gpu mem, max acc gpu, data must always available, because the GPU has so many processing elements this can become a challenge. CPU, memory and speeds must support. The GPU must be kept "- - - -"

6.3 Importance of keeping GPU busy

This type of node training requires an exponential increase in compute. An AI interaction or experience can require only a small amount of compute.

Multi node GPU Acceleration AI training

Uses more than one GPU, large scale, 1k's of nodes, multi nodes, higher perf, network tech, speed and latency of network matter, control data transfer. Engineer, phys, genet

Multi node GPU accel AI

Network topology, bandwidth latency, Network protocols, data transfer techniques, management tools are key factors when determining

Performance

6.3 Access management function of individual devices remotely, determine the status of any network component, Minimize downtime, support without tech, eliminate.

6.3 Out of Band Management

6.3 Designed for the DGX A100 system. This type of management controller, monitor, micro controller, embedded, manages interface systems management software and platform hardware. Specilaized microcontroller embedded on the motherboard of a computer, generally a server. used for monitoring and controlling various hardware devices on the system like system sensors and other parameters.

6.3 Out of Band Management (BMC) Baseboard Management Controller.

6.3 Using SSH commands, gives you the ability to manage servers in remote physical locations, regardless of the O/S. Before an OS has booted, system is powered down, OS system failure, no need for remote loging using SSH commands to configure things like IP addresses. .

Data Center Network IPMI (Unit 6.3) Intelligent Platform Management Interface

(Unit 6.3) This technology features high output and low latency with processing overhead, specifics standards set by IBTA Upper, Trans,Netw, Link, Phys layers. EDR, HDR, NDR (100, 200, 400Gbps). Low latency, high througput, HPC, Cloud, Data Center, AI. RDMA, DMA, Host Channel Adapters.

Infiniband (Unit 6.3)

(Unit 6.3) This protocol allows for the use of efficient data transfer, compute, reduces power cooling and space, support for message passing, sockets, suppoort by all major OS, no CPU intervention, hardware offload

Remote Direct Memory Access (Unit 6.3)

(Unit 6.3) The ability of a device to access host memory, directly, without the intervention of the CPU. This is not IPMI related.

Direct Memory Access (Infiniband Unit 6.3)

(Unit 6.3) Pre-dominant LAN technology. 1979, IEEE, describes how network devices can format and transmit data to other devices, increased to 400gbps, broad range of apps,

Ethernet for AI Workloads (Unit 6.3)

(Unit 6.3) This is an Open Source networking technology but it is not Infiniband. It accelerates AI, storage and big data over ethernet Networks, OS is bypassed, no latency. Infiniband RDMA is superior over this Networking technology. It is not NFS.

RoCE (Rock-y Unit 6.3)

(Unit 6.3) Direct communication between GPU's, remote systems, better performance, higher ROI, 2.5x perfo, better scaling,

Nvidia GPU Direct RDMA (Unit 6.3)

(Unit 6.3) This Nvidia product is used for High performance, MP, SMNICS, All speed from 10gb to 400gb's ethernet connectivity. Software defined harwared accelerated networking.

Nvidia ConnectX (Unit 6.3)

(Unit 6.3) This Nvidia hardware product is fully programmable DPU, Acc. Networking, Storage and Security. Powerful ARM, Advanced Hardware Accelerations. 200gb/s ethernet and Infiniband

Nvidia Bluefield (Unit 6.3)

(Unit 6.3) Built for scale switch, Easiest AI Config, best telemetry, Highest Op efficiency, Perfor,predictable QOS

Nvidia Spectrum (Unit 6.3)

(Unit 6.3) Infiniband switch, HDR 200Gbps, full transport, In Network Compute, RDMA, GPU, Direct GDS, Adaptive routing, congestion control and QOS

Nvidia Quantum (Unit 6.3)

(Unit 6.3) Copper Direct Attached cables. Unmatched quality, DAC splitter cables and adapters, active optical cables, multi-mode and single mode transceivers..

6.3 Nvidia LinkX

6.4 This is a type of "Architecture" and best practices guide for Dense computing, multiple server types, fabrics, storage, and management systems, for achieving maximum performance, minimum bottlenecks, best of breed systems, design provide high performance, cost effective.

6.4 Reference Architectures Best Practices

6.4 The benefits of this type of "Architecture" include solving problems, offer a solution that can be tailored, a roadmap for quick deploy, lessons learned steps, design and deploy.

6.4 Benefits of Reference Architectures

6.4 A Blueprint for Designing AI clusters up to 8 nodes. This piece of Nvidia DGX hardware is specifically for compute, network, power, cooling, architecture with storage

6.4 Nvidia DGX POD

6.4 This Nvidia DGX "Reference Architecture" consists of four DGX A100's, full bisectional bandwidth, 2 IFNBAND switches for balance and growth, 2 storage switches on the back-end. Management servers and PDU's were also defined by the specific partner. Name the Nvidia DGX "RA"

6.4 Nvidia DGX POD Reference Architecture

6.4 This Nvidia DGX product is World fastest, turnkey solution, No complexity, optimized software, full stack, offering, white glove, ramp up, plan/ deploy, ramp/optimize.

6.4 Nvidia DGX Super POD

6.4 Name this Nvidia DGX solution. Here is the layout, specifically, for this device: Compute - Start with 20 DGX A100 systems, 100 PF, infiniband spine switch, mgmt nodes, ethernet connect, full, high speed, leaf switches,compute and storage fabrics

6.4 Nvidia DGX Super POD layout

6.4 White-Glove implementation

6.4 Ramp-Up service

6.4 Ground Control to Major Tom. This DGX Super POD Software is used for the following: Deploy, provision, monitor, alert, slurm, log, alert, resource utilization, analytics. Name this Nvidia DGX SuperPOD software

6.4 Nvidia Base Command Manager

A portable unit of software that combines the application and all it dependencies into a single package. Containers enable you to focus on building AI

Container

Slurm, Containers, Kubernetes

Key Technologies for Deployment

Package app: Package the app with all dependcies Move from one system to another seamlessly. Libraries Compilers Network Drivers Other components

Containers

The Workload Manager. The Job scheduler to manage allocation of resources and launching jobs on a cluster.

Slurm

Orchestration tool to easily deploy containers on various nodes. Automatically spins up nodes to meet demand.

Kubernetes

The following is an AI workflow, specifically what ype of AI Workflow is it? Allows for Model Optimize, Data Factory, AI Model Training, AI Model Testing

AI Development Workflow

This AI Model training "LD" using a DL framework from the NVIDIA GPU Cloud NGC container repository, running on servers with Tensor Core GPU's

AI Model training labeled data

This type of Factory Collects raw data and includes tools used to pre-process, index, label and manage data.

Data Factory

What type of optimization is used for Production Deployment is completed using the NVIDIA TensorRT optimizing inference accelerator.

Model Optimization

This type of testing is used for validation and it adjusts model parameters as needed and repeats training until the desired accuracy is reached.

AI Model Testing

NGC containers Pre-Trained Models Helm Charts Collections Industry

NGC Catalog

This type of container Offers certified images that have been scanned. Designed to support multi-gpu and multi-node Offer pre-trained models across a variety of doms. Does not allow packaging the application and its deps. Does not need to be compiled when they move from.

NGC Containers

Pre-train models

Helm Charts

Collections

Industry

A repository of AI models tagged with their histories and attributes. An automated " "_ _" type of pipeline that manages datasets, models and experiments through their lifecycle. Software containers, typically based on Kubernetes, to simplify running these jobs...What is this type of operation referred to as ...Best practices for business to run AI successfully with help from an expanding smorgasbord of software products and cloud services.Data sources and the datasets created from them

MLOPS or Machine Learning Operation

What is this a value of ...cloud and what else. Early exploration, Limited Access to capital, Low cost start Great Resource Elasticity Modest datasets local to local cloud Fewer experiments Slower pace

Cloud (Value of On prem and cloud infrastructure)

Deep Learning Enterprise Avoid Data Privacy Risks Requires GPU ready data center Large datasets local to on-premises Frequent Experiments Rapid Pace

On-Prem (Value of On-Prem and Cloud Infrastructure)

Data Gravity, sovereignty and security Maintaining lowest cost per training run Ensuring ability to fail fast and learn faster

Factors to weigh (Values of On-Prep and Cloud Infrastructure)

The Tendency for processing activities to gravitate toward where the data resides.

Data Gravity

Unit 5 Cloud backed GPU - Hybrid workflow across cloud and on-prem Best Practice is to keep compute co-resident with where datasets live Weight the benefit of cost predictability of the cloud small scale experiment. GPU Platform (in cloud or on prem) should be purpose built for the workload. Should be purpose built for the workload Watch IO cost curve for moving datasets to compute Best models are created when your team can experiment without fear of cost run Ensure time to solution training objectives.

Unit 5 Hybridized Workflow

A team collaboration platform for model experimentation, tuning and optimization (to office)...This type of DGX A100 is either experimentation or training

Unit 5 Experimentation DGX Station A100

Unit 5 Effortless portability of modes from experimentation to production (to data center) private registry...This type of DGX A100 is training or experimentation.

Unit 5 Training DGX A100 System

Unit 5. Expanding to the DGX POD platform for training at scale.

Unit 5 Training at Scale Nvidia DGX POD

Unit 5. Training models can be used for inference on different platforms or endpoints such as umanned autonomous robots.

Unit 5. Inference Nvidia EGX

Easily accessible platform for the initial phase of model experimentation and development.

Cloud Hosted Environment. Accelerating AI Value and Time.

Unit 5. Highest Perf Compute, AI HPC Data Proc AI Inference & Mainstream Compute Highest Perf graphics visual computing Highest Density Virtual Desktop

Unit 5. NVIDIA GPU Portfolio A100, A30, A40, A16

Unit 5. Use this type of interconnect for Exponential growth in required computing capacity, Multi GPU systems allow for near linear performance scaling, Flexible high bandwidth inter GPU communication required, NVIDIA NVLink interconnect allows GPUs to communicate at high speeds

Unit 5 Nvidia NVlink Interconnect GPU to GPU Connection

Unit 5. With AI and HPC workloads all to all gpu's comm is required. Nvidia nwswitch tech enables direct comm between any GPU pair without bottlenecks. Each GPU uses NVLink interconnects to comm. with all NVSwitch fabrics

Unit 5. Nvidia NVSwitch Fabric

Unit 5. This DGX hardware device architecture constains six 2nd gen NVSwitch fabrics that interconnect the A100 GPU's using NVLink high speed interconnect. Each GPU uses NVLink Interconnect to communicate with all six fabrics.This NVSwitch tech eliminates bottlenecks and allows up to 5 petaflops of AI performance for the next gen of AI networks. What type of DGX device is being referred to:

Unit 5. DGX A100 System Architecture

7.1 Deploying this type of DIY device, which is basic, unoptimized, non-tuned, up and running system? involves this many steps? and this many pages in a doc? The manual install is comprised of Drivers, Libraries, Primitives, Packages.

DIY Standing up a Deep Learning Platform (server) in 10 steps and 380 pages (Unit 7.1)

7.1 Install this type of DIY learning platform steps...include install Linux, Install CUDA, Install CUDNN, Install and Upgrade PIP, Installed BAZEL, Install TensorFlow, Upgrade Protobuf, install docker, test the install, debug and fix install.

DIY Standing up a Deep Learning Platform (server) in 10 steps and 380 pages (Unit 7.1)

7.1 These are Pre-Install requirements for what type of DIY Platform? 1. Verify that the system has a CUDA-capable GPU 2. Ascertain whether the system is running a supported version of Linux 3. Ensure the system has GCC installed 4. Check to see if the system has the correct kernel headers and development packages installed. 5. Download the NVIDIA CUDA Toolkit 6. Handle conflicting installation methods.

DIY Standing up a Deep Learning Platform server. Booting and Installing Compute Nodes (Unit 7.1)

The benefits, with regards to 'Infrastructure Provisioning and Management' include: Increased Utilization, Performance scaling, and User profiles. Where a typical user may use 2 GPU but a power user could use 8 GPU's or more. These are the benefits of a...

Benefits of a cluster. Infrastructure Provisioning and Management (Unit 7.1)

The following steps are necessary for this process of booting and installing compute nodes with Infrastructure Provisioning and Management: 1. Create the admin node and configure it to act as an installation server for the compute nodes in the cluster. 2. Boot the compute nodes 1x1 connecting to the admin server and launching the installation. 3. Install the job queue system to them enabling them to work together as a high performance cluster. ...Name this process:

How to get a cluster up and running (Booting and Installing Compute Nodes) Infrastructure Provisioning and Management (Unit 7.1)

7.1 These steps are necessary for creating a special type of file for the booting and installing of compute nodes: Nvidia driver, CUDA Toolkit, CUDA Sample... 1. Distribution of specific instructions for disabling the Nouveau driver 2. Steps for Verifying Device Node Creation 3. Advanced Options for the installer and Uninstall Steps. 4. This does not include cross-platform development. ...Name this file

Install Runfile for Booting and Installing Compute Nodes (Unit 7.1)

AI Development Workflow Orchestration and Job Scheduling (Unit 7.2) Consists of these 4 traits _ _ _ _ factory, _ _ _ _ _ training, _ _ _ _ _ testing, and _ _ _ _ _ optimization

1. Data Factory, 2. AI Model Training, 3. AI Model Testing, 4. Model Optimization

AI Development Workflow Orchestration and Job Scheduling (Unit 7.2) collects raw data and includes tools used to pre-process, index, label and manage data.

7.2 Data Factory - collects raw data and includes tools used to pre-process, index, label and manage data.

AI Development Workflow Orchestration and Job Scheduling (Unit 7.2) AI Model Training

7.2 AI Model Training Labeled data using a DL framework from the NVIDIA GPU Cloud. NGC Container repository, running on servers with Tensor Core GPU's

AI Development Workflow Orchestration and Job Scheduling (Unit 7.2) AI Model Testing

7.2 AI Model Testing Validation adjusts model parameters as needed and repeats training until the desired accuracy is reached. (Unit 7.2)

AI Development Workflow Orchestration and Job Scheduling (Unit 7.2) Model Optimization

7.2 Model Optimization Production deployment (inference) is completed using the NVIDIA Tensor RT optimizing inference accelerator.

What is the term for the following workflow (Unit 7.2 AI as a service): 1. User defines pipeline, each step uses a container; submits it to the cluster 2. Cluster finds resources for each step of pipeline, spawning necessary containers and tapping into GPU's 3. Results are then written to disk and the user analyzes.

7.2 Machine Learning Workflow...AI as a service

NVIDIA AI final Flashcards

(140 cards)