3 Months Free Update
3 Months Free Update
3 Months Free Update
You are managing a Slurm cluster with multiple GPU nodes, each equipped with different types of GPUs. Some jobs are being allocated GPUs that should be reserved for other purposes, such as display rendering.
How would you ensure that only the intended GPUs are allocated to jobs?
A GPU administrator needs to virtualize AI/ML training in an HGX environment.
How can the NVIDIA Fabric Manager be used to meet this demand?
You are managing an on-premises cluster using NVIDIA Base Command Manager (BCM) and need to extend your computational resources into AWS when your local infrastructure reaches peak capacity.
What is the most effective way to configure cloudbursting in this scenario?
A system administrator of a high-performance computing (HPC) cluster that uses an InfiniBand fabric for high-speed interconnects between nodes received reports from researchers that they are experiencing unusually slow data transfer rates between two specific compute nodes. The system administrator needs to ensure the path between these two nodes is optimal.
What command should be used?
You are managing a high-performance computing environment. Users have reported storage performance degradation, particularly during peak usage hours when both small metadata-intensive operations and large sequential I/O operations are being performed simultaneously. You suspect that the mixed workload is causing contention on the storage system.
Which of the following actions is most likely to improve overall storage performance in this mixed workload environment?
Your organization is running multiple AI models on a single A100 GPU using MIG in a multi-tenant environment. One of the tenants reports a performance issue, but you notice that other tenants are unaffected.
What feature of MIG ensures that one tenant's workload does not impact others?
A Slurm user needs to submit a batch job script for execution tomorrow.
Which command should be used to complete this task?
An organization only needs basic network monitoring and validation tools.
Which UFM platform should they use?
A cloud engineer is looking to deploy a digital fingerprinting pipeline using NVIDIA Morpheus and the NVIDIA AI Enterprise Virtual Machine Image (VMI).
Where would the cloud engineer find the VMI?
You are managing a high availability (HA) cluster that hosts mission-critical applications. One of the nodes in the cluster has failed, but the application remains available to users.
What mechanism is responsible for ensuring that the workload continues to run without interruption?
You are monitoring the resource utilization of a DGX SuperPOD cluster using NVIDIA Base Command Manager (BCM). The system is experiencing slow performance, and you need to identify the cause.
What is the most effective way to monitor GPU usage across nodes?
A system administrator is troubleshooting a Docker container that crashes unexpectedly due to a segmentation fault. They want to generate and analyze core dumps to identify the root cause of the crash.
Why would generating core dumps be a critical step in troubleshooting this issue?
An administrator requires full access to the NGC Base Command Platform CLI.
Which command should be used to accomplish this action?