Unit 14 Orchestration, MLOps and Job Scheduling Flashcards
Objectives: >describe the difference between orchestration and scheduling >describe common tools for orchestration and scheduling >discuss the value of MLOps
What is K8S?
Kubernetes is an open-source container orchestration system for automating software deployment, scaling, and management. It is often used in container based environments that need to scale to meet user needs and is useful for inference in AI clusters.
What is the difference between orchestration and scheduling?
Orchestration is container-based, designed for micro-services, and adapted for AI. It scales up/down for inferencing, manages entire workflows and processes, and load balances to distribute traffic across containers. Scheduling is bare-metal based, supports containers, and is designed for HPC. It has advanced scheduling features built-in, assigns tasks and jobs to available resources, and determines hosts with available resources to run containers.
What is SLURM?
SLURM (Simple Linux Utility for Resource Management) is an open-source cluster management and job scheduling system for Linux clusters, widely used by supercomputers and computing clusters around the world. It is highly scalable, fault-tolerant, and requires no kernel modifications. SLURM efficiently schedules jobs across a subset of cluster resources, including CPUs and GPUs, making it ideal for high-performance computing (HPC) tasks and AI training.
What is MLOps?
MLOps is short for Machine Learning Operations. MLOps tools help to improve user productivity and speed up workflow, maximize utilization of resources, and allow projects to scale. MLOps tools help to bring consistency and repeatability to AI and ML workloads.
list Kubernetes Components
Node(узел): Server added to K8S cluster.
Cluster: A collection of one or more servers
Container: a self-contained, deployable application
Pod: a container and associated meta-data
Volume: attached storage that can be shared in K8S
Service: Networking and ports. A mechanism for connecting applications over a network and managing ports.
Workload mgmt: managing workloads (applications packaged in containers) running in a cluster. In K8s, workloads are collections of objects like Job, DaemonSet, Deployment, and CronJob.
* Job: Launches a Pod to perform a one-time task and then completes.
* DaemonSet: Ensures a copy of a Pod runs on every (or selected) node, often used for monitoring or network agents.
* Deployment: Maintains a specified number of Pod replicas and redeploys them if deleted, ensuring Pods are in the desired state.
NVIDIA GPU Operator
NVIDIA GPU Operator - an open-source tool that provides IT infrastructure teams with the necessary resources and tools to efficiently manage and deploy GPUs in Kubernetes environments.
Container Engine
Container Engine - for instance, docker, virtualization software makes developing and deploying applications much easier, packages applications with all necessary dependencies configuration system tools ,and runtime.
NVIDIA Network Operator
NVIDIA Network Operator a tool installed on top of GPU operator can be installed on top of GPU operator to enable GPU direct RDMA remote direct memory access.
Name MLOps Partners
Runai
Paperspace
DeterminedAI