NVIDIA AI final Flashcards

1
Q

Question

A

Answer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Unit 4. Employs algorithms and statistical models that enable computer systems to find patterns in massive amounts of data, and then uses a model that recognizes those patterns to make predictions or descriptions on new data. Is this Deep, Machine, Neural, Deep Neural Network learning.

A

Unit 4. Machine Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Unit 4. This framework is an essential tool for Data Scientists. This is a Computer Vision, Natural Language Processing, Speech and Audio Processing, Robot learning more. Interface. Library or Tool. What is this framework? AI, ML, DNN, MDL framework.

A

Unit 4. Machine Deep Learning Frameworks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Unit 4. PyTorch Geometric DGL and others rely on these libraries such as cuDNN, NCCL and DALI to deliver high-performance, accelerated training. What is this type of accelerated training. Is this Deep Learning, Machine accelerated, GPU accelerated, AI?

A

Unit 4. GPU Accelerated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Unit 4. This framework offers building blocks for designing, training and validating deep neural networks through a high-level programming interface. Widely used frameworks such as PyTorch and TensorFlow. Is this AI, DL DNN, or ML frameworks.

A

Unit 4. Deep Learning Frameworks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Unit 4. A sub-class of Machine Learning. It uses neural networks to train a model. Using very large data sets. In the range of Terabytes or more of data. Is the answer: Machine Learning, AI, or Deep Learning or Deep Neural Network approach.

A

Unit 4. Deep Learning Approach

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Unit 4. This type of Neural Network model are Algorithms that mimic the human brain in understanding complex patterns. Once trained, on new images, it can make predictions. What is this type of Neural Network Model?

A

Unit 4. Deep Neural Network Model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Unit 4. What is this type of training data? It is a set of data with “_ _ _ _ _ ” that help the neural network learn. These “ _ _ _ _ _” can be the objects in the images: cars, trucks, cranes. The error that the classifier makes on the training data are used to incrementally improve the network structure.

…Name this type of training data (- - - - - -)

A

Unit 4. ‘Labels’ as in labeled Training Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Unit 4. Once the neural network based model is trained it can make this type of “predictions” on new images. Once trained the network and classifier are deployed against previously unseen data, which is not labeled. If the training was done correctly, the network will be able to apply its feature representation to correctly classify similar classes in different situations. These “predictions” are also referred to a certain “class”

A

Unit 4. Object Class Predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Unit 4. A modern Open Source Machine Deep learning framework used to train and deploy deep neural networks. It is scalable allowing for fast model training, and supports a flexible programming modem and multiple languages. This type of library is portable and can scale to multiple GPU’s and multiple machines.

A

Unit 4. Machine Deep Learning Frameworks - MXNet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Unit 4. Machine DL Frameworks. This free software machine learning scientific library (framework) for Python Program language features various classification, regression and clustering algorithms. Choose mxnet, scikits-learn or tensorflow.

A

Unit 4. Machine Deep Learning Frameworks - SciKit Learn
…and is designed to interoperate with the Python numerical and scientific libraries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Unit 4. This is an essential tool for Data Scientist in the Machine Deep Learning Framework. It is also a popular Open source software library (framework) for dataflow programming across a range of tasks. It is a symbolic math library and is commonly used for deep learning applications.
Is it MXNet, or SciKit-learn or TensorFlow

A

Unit 4. Machine Deep Learning Frameworks - Tensor Flow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Unit 4. This Nvidia Deep Learning Software Stack is comprised of Host OS and NVIDIA Driver, NGC Container, DL Frameworks

A

Unit 4. Nvidia Deep Learning Software Stack

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Unit 4. This Nvidia Deep Learning Software Stack “OS” enables the deep learning framework to use the GPU functions

A

Unit 4 Host OS and Nvidia Drive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Unit 4. These publicly available containers, are optimized to run NVIDIA GPU’s in the Nvidia Deep Learning Software Stack.

A

Unit 4 NGC Container

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Unit 4. This popular type of framework(s) is available inside the containers for Nvidia Deep Learning Software Stack. Is it ML, AI, DL, DNN?

A

Unit 4. DL or deep learning Frameworks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Unit 4. Nvidia Deep Learning Software Stack - The name for Nvidia’s groundbreaking parallel programming model that provides essential optimization for deep learning.

A

Unit 4. A CUDA MATADA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Unit 4 Accelerate data preparation, Model Training, Visualization with this type of software stack

A

Unit 4 Machine Learning Software Stack

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Unit 4 Machine Learning Software Stack “Columnar name” in memory data structure “_ _ _ _ _ _” arrow

A

Unit 4 Apache arrow (Machine Learning Software Stack) which Delivers efficient and fast data interchange with the flexibility to support complex data models. What is the Columnar name referred to as

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Unit 4. A suite of open source software libraries and API’s which offers the ability to execute end to end data science and analytics for executing data science pipelines, entirely on GPU’s. And can “reduce” training times from days to minutes. Built on NVIDIA® CUDA-X AI.

A

Unit 4. RAPIDS (Machine Learning Software Stack)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

(Unit 4) A framework and collection of graph analytics libraries that seamlessly integrates into the RAPIDS data science platform Tensor RT

A

Unit 4. CUGRAPH (Machine Learning Software Stack) Nvidia GPU Software Ecosystem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Unit 4. A Dataframe manipulation library based on Apache Arrow that accelerates loading, filtering and manipulation of data for model training data preparation. dask, cudf, cuml, cudnn

A

Unit 4. CUDF (Machine Learning Software Stack)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Unit 4. A collection of GPU accelerated machine learning libraries that will provide GPU versions of all machine learning algorithms available, including SciKit-learn Knn, Kmeans, Random Forest and Regressions. Is it rapids, cuml, dask, python

A

Unit 4. CUML (Machine learning software stack)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Unit 4. Give users the ability to run jobs in the map reduce style of programming. Which allows pipelines to stage data in main memory, if everything doesn’t fit in GPU memory. cuml, cudf, dask, cugraph

A

Unit 4. DASK (Machine Learning Software Stack)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Unit 4. Developers use this language which is a simple programming language, to develop models using the above libraries.

A

Unit 4. Python

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Unit 4. This Is a collection of software acceleration libraries built on top of CUDA and over 13 other libraries. Increase productivity.

A

Unit 4. Nvidia CUDA-X AI Ecosystem - CUDA-X AI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Unit 4 Nvidia Deep Learning Software Stack for Accelerating Deep learning primitives. Is this cuDL or cuDNN or cuML or python?

A

Unit 4. Nvidia CUDA-X AI Ecosystem - CUDA-X-AI - cuDNN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Unit 4. Accelerating data science workflows and “Machine Learning” algorithms ecosystem. Name this Nvidia CUDA-X AI Ecosystem - CUDA-X-AI

A

Unit 4. Nvidia CUDA-X AI Ecosystem - CUDA-X-AI - cuML

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Unit 4. Nvidia Deep Learning Software Stack Nvidia Optimizing Trained Models for Inference

A

Unit 4. Nvidia CUDA-X AI Ecosystem - CUDA-X-AI-NVIDIA TensorRT

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Unit 4. Nvidia Deep Learning Software Stack This Nvidia CUDA hardware provides a data frame manipulation library ecosystem. Name this CUDA-X-AI-NVIDIA DL software stack.

A

Unit 4. Nvidia CUDA-X AI Ecosystem -CUDA-X-AI-NVIDIA cuDF

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Unit 4. Nvidia Deep Learning Software Stack For performing high-performance analytics on Graphs

A

Unit 4. Nvidia CUDA-X AI Ecosystem-CUDA-X-AI-NVIDIA cuGraph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Unit 4. Nvidia Deep Learning Software Stack cuDNN, cuML, TensorRT,cuDF, cuGraph, together they work seamlessly with this Nvidia product to accelerate the development and deployment of AI based applications.
Is it DL, ML, AI, Tensore core

A

Unit 4. Nvidia CUDA-X AI Ecosystem -CUDA-Xi-AI Nvidia Tensor Core GPU

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Unit 4. Nvidia Deep Learning Software Stack This type of Nvidia CUDA-X framework is used for Desktops, workstations, servers, cloud computing deployments and software acceleration libraries.

A

Unit 4. Nvidia CUDA-X AI Ecosystem - Frameworks, Cloud ML, Deployments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Unit 1. Clinical care, Operational Efficiency (no-shows), Precision Medicine (radiomics, one size fits all), Drug Discovery (monitoring). Is it ML, DL, or AI

A

Unit 1. AI in Healthcare

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Applications include Radiomics (biomarker), At-Risk Patients, Medical billing, Disease/Genetic correlation, Medical Transcription, Drug Interactions, Cancer Detection.

A

Applications of AI in Healthcare

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

A broad field of study focused on using computers to do things that require human-level intelligence

A

Artificial Intelligence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

An approach that uses statistical learning Algorithm

A

Machine Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

A techinque inspired by how human beings learn.

A

Deep Learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Where Computations can run on CPI cores and on GPU’s

A

Compute Nodes (AI Cluster Components)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Where data is stored

A

Storage Nodes (AI Cluster Components)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

6.1 These types of nodes for Multi System AI cluster are used for system monitoring, provisioning, troubleshooting. Services required can include user authentication, network proxies, workload, data, fabric, system management and monitoring and general user and acess and services.

Tip:Containerization tools such as Docker are often used to separate and manage devices. Reliable, resilient and robust servers are often required to ensure a highly available system.

A

6.1 Management Nodes (AI Cluster Components)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

6.1 What is used for AI Cluster Components, that connects compute nodes, storage nodes, and management network services. It is also used specifically for when the nodes are powered off. (In Band, Out of band, rubber band, or GUI, IPMI Networking)

A

6.1 Out of Band Networking (AI Cluster Components)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

6.1 These type of nodes are for GPU based servers provide most of the computational resources and more power efficient. All components must keep up. Sharing data across multiple systems and multiple users.

A

6.1 Compute nodes (AI Cluster)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

6.1 Provides nodes functionality rack of servers into a system. Service required user authentication, network proxies, workload, data, fabric, system management and monitoring. Is it Network, Switch, Storage, Out of band

A

6.1 Management Nodes (AI Cluster)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Connects Compute Nodes

A

Computer Network (AI Cluster)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Connects storage nodes

A

Storage Network (AI Cluster)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Used by all services necessary for system to operate

A

Management Network (AI Cluster)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Provides Best Practices to Design Systems for AI Workloads. Provides proven designs that organizations can leverage for their own needs as well as a recipe for getting started. (Model, Container, Reference)

A

Reference Architectures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

This Nvidia DGX device for Reference Archie. With two - eight DGX A100 systems, compute servers, Nvidia storage partners

A

NVIDIA DGX POD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

This Nvidia DGX device uses Configurations starting with 20, infused with expertise, designed to support widest range of DL and HPC workloads.

A

NVIDIA DGX SuperPOD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

(Unit 6.2) Training DL and ML Models Requires “Massive Datasets” to obtain high accuracy.This increase in complexity leads to increased accuracy. What is this type of consideration for AI Workloads. Is it DL, ML, AI, DNN, RA, cuFL, storage?

A

Storage for AI Workloads Unit 6.2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

6.2 Data should be visible, labeled, resiliency, recall, reconstructed, controls, vet, monitor, robust, end user needs, high perf, shared, data stewardship. These are the AI Characteristics of this type of “data” systems.

A

6.2 Storage Systems Characteristics for AI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

6.2 These questions should be asked when deciding on this one, specific, type of data solution…How often will it be accessed, How often will it be written too, How often will it be read, when will it retired, what if there are system failures, Will this be fast storage, Is the data private…once again, these are questions for a very particular type of data solution.

A

6.2 Deciding on a storage solution. When deciding on a storage solution the full life cycle data should be considered.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

(Unit 6.2) This is a type of storage…
Simpler than Traditional shared f/s
Scale storage massively (PB)
High level of data protection via data replication
Traditionally used in large cloud data storage repos
No directory structure, files are referenced by keys
Files are accessed via a REST API
not a standards, or pplications
must be re-written to directly access data

A

Object storage Unit 6.2 Storage considerations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

6.2 SQL, NoSQL, SQL-Like databases. Unique perfo. Charact. Access methods. Not as general as other fs types belong in this category of data storage systems…not parallel or distributed but “- - - - -“ data storage systems.

A

6.2 Other Data Storage Systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

6.2 This type of storage file system (in a data hierarchy) can share data, group servers, scale out, and it can offer the highest read and write speeds…It is not NFS or Local. It is:

A

6.2 Parallel and Distributed fs…Storage systems data hierachy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

6.2 This filesystem can provide a local like view of data to a group of servers, Often accomplished using open standards based protocols. However, it is not local or parallel. It most often uses remote ssh commands

A

6.2 NFS Storage systems data hierarchy

58
Q

6.2 This type of file system is fast, strong performance, simple, not shared and not NFS.

A

6.2 Local FS

59
Q

6.2 AI apps need large storage, that is read, IO focus, but should also have good write IO performance. This approach is key to this type of storage solution.

A

6.2 Overall Storage Solution

60
Q

6.2 Each piece of this simpler type of datashierarchy presented has an important role to play in storing AI data and models

A

6.2 Storage HierArchi

61
Q

6.2 These pieces can all be combined into this type of “- - - - -“ tiered approach to balance this type of storage hierarchical strength

A

6.2 Multi-Tiered Storage HierArchy

62
Q

6.2 Train ML and DL read repeat, random, optimize storage, read and re-read, read IO and write IO, larger models, storage with fast read IO and data cache many time offers best..Are all key characteristics to understanding this type of IO access.

A

6.2 Understanding Data Access

63
Q

6.2 Data is the most important asset, many differences are shared for this type of consideration. DL training read and re-read, rate, IO model size, model train same time, ample storage, partners reduce time…These are are all a certain type of data consideration.

A

6.2 Storage Considerations

64
Q

6.2q Bandwidth, IOPS, MetaData Ops

A

6.2q Storage Performance

65
Q

6.2q Control should be in place for steward, data should be shared, High performance.

A

6.2q Key Characteristics of Storage Systems for AI

66
Q

6.2q Data Records are repeatedly accessed in random…read
Models get larger in size…write
Many models are trained at the same time…needs

A

Fast read IO with cache
Fast Write IO
Amplify storage needs

67
Q

Computer, Storage, In-Band Management, Out of Band Management

A

AI Cluster Networks

68
Q

6.3 This type of “- - - - - - -“ network, as it relates to AI cluster Networks, Maximizes the performance of AI workloads. It is also designed to minimizes system bottlenecks. Provides redundancy in the even of hardware failure aand minimizing costs.

A

6.3 Compute Network

69
Q

6.3 This type of network, as it relates to AI Cluster Networks, provides high throughput access to shared storage, high bandwidth capabilities with advanced fabric management features, and provides significant benefits for the storage fabric…This is a “s - - - - - -“ network

A

6.3 Storage Network

70
Q

6.3 This type of network, as it relates to AI Cluster Networks, provides Connectivitiy to the management nodes, ssh, dns, NFS code repositories. Its the primary network for the everything that isn’t related to inter-job communication or high-speed storage access. This is “– - —-“ Management Network

A

6.3 In Band Network

71
Q

6.3 This type of Network, related to AI Clusters, Provides remote management functions even if servers are off on not reachable on the in-band. Remote Power Control, remote serial console, temperature and power sensors, separate network.

A

6.3 Out of Band Network

72
Q

6.3 This is one important characteristic regarding the GPU…Network is crucial for max. GPU ACC. Data sets grow, increase IO bottleneck, storage and gpu mem, max acc gpu, data must always available, because the GPU has so many processing elements this can become a challenge. CPU, memory and speeds must support. The GPU must be kept “- - - -“

A

6.3 Importance of keeping GPU busy

73
Q

This type of node training requires an exponential increase in compute. An AI interaction or experience can require only a small amount of compute.

A

Multi node GPU Acceleration AI training

74
Q

Uses more than one GPU, large scale, 1k’s of nodes, multi nodes, higher perf, network tech, speed and latency of network matter, control data transfer. Engineer, phys, genet

A

Multi node GPU accel AI

75
Q

Network topology, bandwidth latency, Network protocols, data transfer techniques, management tools are key factors when determining

A

Performance

76
Q

6.3 Access management function of individual devices remotely, determine the status of any network component, Minimize downtime, support without tech, eliminate.

A

6.3 Out of Band Management

77
Q

6.3 Designed for the DGX A100 system. This type of management controller, monitor, micro controller, embedded, manages interface systems management software and platform hardware. Specilaized microcontroller embedded on the motherboard of a computer, generally a server. used for monitoring and controlling various hardware devices on the system like system sensors and other parameters.

A

6.3 Out of Band Management (BMC) Baseboard Management Controller.

78
Q

6.3 Using SSH commands, gives you the ability to manage servers in remote physical locations, regardless of the O/S. Before an OS has booted, system is powered down, OS system failure, no need for remote loging using SSH commands to configure things like IP addresses. .

A

Data Center Network IPMI (Unit 6.3)
Intelligent Platform Management Interface

79
Q

(Unit 6.3) This technology features high output and low latency with processing overhead, specifics standards set by IBTA Upper, Trans,Netw, Link, Phys layers. EDR, HDR, NDR (100, 200, 400Gbps). Low latency, high througput, HPC, Cloud, Data Center, AI. RDMA, DMA, Host Channel Adapters.

A

Infiniband (Unit 6.3)

80
Q

(Unit 6.3) This protocol allows for the use of efficient data transfer, compute, reduces power cooling and space, support for message passing, sockets, suppoort by all major OS, no CPU intervention, hardware offload

A

Remote Direct Memory Access (Unit 6.3)

81
Q

(Unit 6.3) The ability of a device to access host memory, directly, without the intervention of the CPU. This is not IPMI related.

A

Direct Memory Access (Infiniband Unit 6.3)

82
Q

(Unit 6.3) Pre-dominant LAN technology. 1979, IEEE, describes how network devices can format and transmit data to other devices, increased to 400gbps, broad range of apps,

A

Ethernet for AI Workloads (Unit 6.3)

83
Q

(Unit 6.3) This is an Open Source networking technology but it is not Infiniband. It accelerates AI, storage and big data over ethernet Networks, OS is bypassed, no latency. Infiniband RDMA is superior over this Networking technology. It is not NFS.

A

RoCE (Rock-y Unit 6.3)

84
Q

(Unit 6.3) Direct communication between GPU’s, remote systems, better performance, higher ROI, 2.5x perfo, better scaling,

A

Nvidia GPU Direct RDMA (Unit 6.3)

85
Q

(Unit 6.3) This Nvidia product is used for High performance, MP, SMNICS, All speed from 10gb to 400gb’s ethernet connectivity. Software defined harwared accelerated networking.

A

Nvidia ConnectX (Unit 6.3)

86
Q

(Unit 6.3) This Nvidia hardware product is fully programmable DPU, Acc. Networking, Storage and Security. Powerful ARM, Advanced Hardware Accelerations. 200gb/s ethernet and Infiniband

A

Nvidia Bluefield (Unit 6.3)

87
Q

(Unit 6.3) Built for scale switch, Easiest AI Config, best telemetry, Highest Op efficiency, Perfor,predictable QOS

A

Nvidia Spectrum (Unit 6.3)

88
Q

(Unit 6.3) Infiniband switch, HDR 200Gbps, full transport, In Network Compute, RDMA, GPU, Direct GDS, Adaptive routing, congestion control and QOS

A

Nvidia Quantum (Unit 6.3)

89
Q

(Unit 6.3) Copper Direct Attached cables. Unmatched quality, DAC splitter cables and adapters, active optical cables, multi-mode and single mode transceivers..

A

6.3 Nvidia LinkX

90
Q

6.4 This is a type of “Architecture” and best practices guide for Dense computing, multiple server types, fabrics, storage, and management systems, for achieving maximum performance, minimum bottlenecks, best of breed systems, design provide high performance, cost effective.

A

6.4 Reference Architectures Best Practices

91
Q

6.4 The benefits of this type of “Architecture” include solving problems, offer a solution that can be tailored, a roadmap for quick deploy, lessons learned steps, design and deploy.

A

6.4 Benefits of Reference Architectures

92
Q

6.4 A Blueprint for Designing AI clusters up to 8 nodes. This piece of Nvidia DGX hardware is specifically for compute, network, power, cooling, architecture with storage

A

6.4 Nvidia DGX POD

93
Q

6.4 This Nvidia DGX “Reference Architecture” consists of four DGX A100’s, full bisectional bandwidth, 2 IFNBAND switches for balance and growth, 2 storage switches on the back-end. Management servers and PDU’s were also defined by the specific partner. Name the Nvidia DGX “RA”

A

6.4 Nvidia DGX POD Reference Architecture

94
Q

6.4 This Nvidia DGX product is
World fastest, turnkey solution, No complexity, optimized software, full stack, offering, white glove, ramp up, plan/ deploy, ramp/optimize.

A

6.4 Nvidia DGX Super POD

95
Q

6.4 Name this Nvidia DGX solution. Here is the layout, specifically, for this device:
Compute - Start with 20 DGX A100 systems, 100 PF, infiniband spine switch, mgmt nodes, ethernet connect, full, high speed, leaf switches,compute and storage fabrics

A

6.4 Nvidia DGX Super POD layout

96
Q

6.4 White-Glove implementation

A

6.4 Ramp-Up service

97
Q

6.4 Ground Control to Major Tom. This DGX Super POD Software is used for the following: Deploy, provision, monitor, alert, slurm, log, alert, resource utilization,
analytics. Name this Nvidia DGX SuperPOD software

A

6.4 Nvidia Base Command Manager

98
Q

A portable unit of software that combines the application and all it dependencies into a single package. Containers enable you to focus on building AI

A

Container

99
Q

Slurm,
Containers,
Kubernetes

A

Key Technologies for Deployment

100
Q

Package app: Package the app with all dependcies
Move from one system to another seamlessly.
Libraries
Compilers
Network Drivers
Other components

A

Containers

101
Q

The Workload Manager. The Job scheduler to manage allocation of resources and launching jobs on a cluster.

A

Slurm

102
Q

Orchestration tool to easily deploy containers on various nodes.
Automatically spins up nodes to meet demand.

A

Kubernetes

103
Q

The following is an AI workflow, specifically what ype of AI Workflow is it?
Allows for Model Optimize, Data Factory, AI Model Training, AI Model Testing

A

AI Development Workflow

104
Q

This AI Model training “LD” using a DL framework from the NVIDIA GPU Cloud NGC container repository, running on servers with Tensor Core GPU’s

A

AI Model training labeled data

105
Q

This type of Factory Collects raw data and includes tools used to pre-process, index, label and manage data.

A

Data Factory

106
Q

What type of optimization is used for Production Deployment is completed using the NVIDIA TensorRT optimizing inference accelerator.

A

Model Optimization

107
Q

This type of testing is used for validation and it adjusts model parameters as needed and repeats training until the desired accuracy is reached.

A

AI Model Testing

108
Q

NGC containers
Pre-Trained Models
Helm Charts
Collections
Industry

A

NGC Catalog

109
Q

This type of container Offers certified images that have been scanned. Designed to support multi-gpu and multi-node
Offer pre-trained models across a variety of doms.
Does not allow packaging the application and its deps.
Does not need to be compiled when they move from.

A

NGC Containers

110
Q
A

Pre-train models

111
Q
A

Helm Charts

112
Q
A

Collections

113
Q
A

Industry

114
Q

A repository of AI models tagged with their histories and attributes. An automated “ “_ _” type of pipeline that manages datasets, models and experiments through their lifecycle. Software containers, typically based on Kubernetes, to simplify running these jobs…What is this type of operation referred to as

…Best practices for business to run AI successfully with help from an expanding smorgasbord of software products and cloud services.Data sources and the datasets created from them

A

MLOPS or Machine Learning Operation

115
Q

What is this a value of …cloud and what else.
Early exploration,
Limited Access to capital, Low cost start
Great Resource Elasticity
Modest datasets local to local cloud
Fewer experiments
Slower pace

A

Cloud (Value of On prem and cloud infrastructure)

116
Q

Deep Learning Enterprise
Avoid Data Privacy Risks
Requires GPU ready data center
Large datasets local to on-premises
Frequent Experiments
Rapid Pace

A

On-Prem (Value of On-Prem and Cloud Infrastructure)

117
Q

Data Gravity, sovereignty and security
Maintaining lowest cost per training run
Ensuring ability to fail fast and learn faster

A

Factors to weigh (Values of On-Prep and Cloud Infrastructure)

118
Q

The Tendency for processing activities to gravitate toward where the data resides.

A

Data Gravity

119
Q

Unit 5 Cloud backed GPU - Hybrid workflow across cloud and on-prem
Best Practice is to keep compute co-resident with where datasets live
Weight the benefit of cost predictability of the cloud small scale experiment.
GPU Platform (in cloud or on prem) should be purpose built for the workload.
Should be purpose built for the workload
Watch IO cost curve for moving datasets to compute
Best models are created when your team can experiment without fear of cost run
Ensure time to solution training objectives.

A

Unit 5 Hybridized Workflow

120
Q

A team collaboration platform for model experimentation, tuning and optimization (to office)…This type of DGX A100 is either experimentation or training

A

Unit 5 Experimentation DGX Station A100

121
Q

Unit 5 Effortless portability of modes from experimentation to production (to data center) private registry…This type of DGX A100 is training or experimentation.

A

Unit 5 Training DGX A100 System

122
Q

Unit 5. Expanding to the DGX POD platform for training at scale.

A

Unit 5 Training at Scale Nvidia DGX POD

123
Q

Unit 5. Training models can be used for inference on different platforms or endpoints such as umanned autonomous robots.

A

Unit 5. Inference Nvidia EGX

124
Q

Easily accessible platform for the initial phase of model experimentation and development.

A

Cloud Hosted Environment. Accelerating AI Value and Time.

125
Q

Unit 5. Highest Perf Compute, AI HPC Data Proc
AI Inference & Mainstream Compute
Highest Perf graphics visual computing
Highest Density Virtual Desktop

A

Unit 5. NVIDIA GPU Portfolio A100, A30, A40, A16

126
Q

Unit 5. Use this type of interconnect for Exponential growth in required computing capacity, Multi GPU systems allow for near linear performance scaling, Flexible high bandwidth inter GPU communication required, NVIDIA NVLink interconnect allows GPUs to communicate at high speeds

A

Unit 5 Nvidia NVlink Interconnect GPU to GPU Connection

127
Q

Unit 5. With AI and HPC workloads all to all gpu’s comm is required.
Nvidia nwswitch tech enables direct comm between any GPU pair without bottlenecks.
Each GPU uses NVLink interconnects to comm. with all NVSwitch fabrics

A

Unit 5. Nvidia NVSwitch Fabric

128
Q

Unit 5. This DGX hardware device architecture constains six 2nd gen NVSwitch fabrics that interconnect the A100 GPU’s using NVLink high speed interconnect.
Each GPU uses NVLink Interconnect to communicate with all six fabrics.This NVSwitch tech eliminates bottlenecks and allows up to 5 petaflops of AI performance for the next gen of AI networks. What type of DGX device is being referred to:

A

Unit 5. DGX A100 System Architecture

129
Q

7.1 Deploying this type of DIY device, which is basic, unoptimized, non-tuned, up and running system? involves this many steps? and this many pages in a doc? The manual install is comprised of Drivers, Libraries, Primitives, Packages.

A

DIY Standing up a Deep Learning Platform (server) in 10 steps and 380 pages (Unit 7.1)

130
Q

7.1 Install this type of DIY learning platform steps…include install Linux, Install CUDA, Install CUDNN, Install and Upgrade PIP, Installed BAZEL, Install TensorFlow, Upgrade Protobuf, install docker, test the install, debug and fix install.

A

DIY Standing up a Deep Learning Platform (server) in 10 steps and 380 pages (Unit 7.1)

131
Q

7.1 These are Pre-Install requirements for what type of DIY Platform?
1. Verify that the system has a CUDA-capable GPU
2. Ascertain whether the system is running a supported version of Linux
3. Ensure the system has GCC installed
4. Check to see if the system has the correct kernel headers and development packages installed.
5. Download the NVIDIA CUDA Toolkit
6. Handle conflicting installation methods.

A

DIY Standing up a Deep Learning Platform server. Booting and Installing Compute Nodes (Unit 7.1)

132
Q

The benefits, with regards to ‘Infrastructure Provisioning and Management’ include: Increased Utilization, Performance scaling, and User profiles. Where a typical user may use 2 GPU but a power user could use 8 GPU’s or more. These are the benefits of a…

A

Benefits of a cluster. Infrastructure Provisioning and Management (Unit 7.1)

133
Q

The following steps are necessary for this process of booting and installing compute nodes with Infrastructure Provisioning and Management:
1. Create the admin node and configure it to act as an installation server for the compute nodes in the cluster.
2. Boot the compute nodes 1x1 connecting to the admin server and launching the installation.
3. Install the job queue system to them enabling them to work together as a high performance cluster.
…Name this process:

A

How to get a cluster up and running (Booting and Installing Compute Nodes) Infrastructure Provisioning and Management (Unit 7.1)

134
Q

7.1 These steps are necessary for creating a special type of file for the booting and installing of compute nodes: Nvidia driver, CUDA Toolkit, CUDA Sample…
1. Distribution of specific instructions for disabling the Nouveau driver
2. Steps for Verifying Device Node Creation
3. Advanced Options for the installer and Uninstall Steps.
4. This does not include cross-platform development.
…Name this file

A

Install Runfile for Booting and Installing Compute Nodes (Unit 7.1)

135
Q

AI Development Workflow Orchestration and Job Scheduling (Unit 7.2) Consists of these 4 traits _ _ _ _ factory, _ _ _ _ _ training, _ _ _ _ _ testing, and _ _ _ _ _ optimization

A
  1. Data Factory,
  2. AI Model Training,
  3. AI Model Testing,
  4. Model Optimization
136
Q

AI Development Workflow Orchestration and Job Scheduling (Unit 7.2) collects raw data and includes tools used to pre-process, index, label and manage data.

A

7.2 Data Factory - collects raw data and includes tools used to pre-process, index, label and manage data.

137
Q

AI Development Workflow Orchestration and Job Scheduling (Unit 7.2) AI Model Training

A

7.2 AI Model Training Labeled data using a DL framework from the NVIDIA GPU Cloud. NGC Container repository, running on servers with Tensor Core GPU’s

138
Q

AI Development Workflow Orchestration and Job Scheduling (Unit 7.2) AI Model Testing

A

7.2 AI Model Testing Validation adjusts model parameters as needed and repeats training until the desired accuracy is reached. (Unit 7.2)

139
Q

AI Development Workflow Orchestration and Job Scheduling (Unit 7.2) Model Optimization

A

7.2 Model Optimization Production deployment (inference) is completed using the NVIDIA Tensor RT optimizing inference accelerator.

140
Q

What is the term for the following workflow (Unit 7.2 AI as a service):
1. User defines pipeline, each step uses a container; submits it to the cluster
2. Cluster finds resources for each step of pipeline, spawning necessary containers and tapping into GPU’s
3. Results are then written to disk and the user analyzes.

A

7.2 Machine Learning Workflow…AI as a service