Practice Free NCP-AII NVIDIA AI Infrastructure Exam Questions Answers With Explanation

We at Crack4sure are committed to giving students who are preparing for the NVIDIA NCP-AII Exam the most current and reliable questions . To help people study, we've made some of our NVIDIA AI Infrastructure exam materials available for free to everyone. You can take the Free NCP-AII Practice Test as many times as you want. The answers to the practice questions are given, and each answer is explained.

Get Full 123 Questions Search Other NVIDIA Exam

Question # 6

During HPL execution on a DGX cluster, the benchmark fails with " not enough memory " errors despite sufficient physical RAM. Which HPL.dat parameter adjustment is most effective?

Reduce the problem size while maintaining the same block size.

Set PMAP to 1 to enable process mapping.

Increase block size to 6144 to maximize GPU utilization.

Disable double-buffering via BCAST parameter.

Question # 7

After upgrading to HPL-AI 2.0 on a DGX A100 cluster, a 2x performance gain is observed. Which optimization is primarily responsible for this improvement?

Reduction of problem size (N) to accelerate computation.

MPI-aware GPU communication that reduces CPU bottlenecks and GPU idle time.

Doubling of GPU clock speeds through firmware updates and relevant configuration.

Automatic NVLink bandwidth doubling via driver updates.

Question # 8

During a multi-day NeMo burn-in, intermittent " GPU fell off bus " errors occur. Which diagnostic approach isolates hardware faults?

Enable HPL_USE_NVSHMEM for alternative memory sharing.

Run DCGM diagnostics alongside burn-in to monitor GPU health metrics.

Switch from BERT to GPT models for simpler computations.

Reduce blocksize to 500MB to lower memory pressure.

Question # 9

When configuring an out-of-core HPL burn-in for a 40B matrix on 8x H100 nodes, which environment variable prevents GPU out-of-memory errors while reserving space for drivers?

export HPL_OOC_SAFE_SIZE=4.0

export HPL_OOC_MODE=0

export HPL_OOC_NUM_STREAMS=8

export HPL_OOC_MAX_GPU_MEM=90

Question # 10

After a firmware upgrade on a DGX H100, the administrator notices that one GPU is not detected by the system. Which troubleshooting step should be performed first to identify the root cause?

Review firmware update logs and run nvsm show health to check for hardware or firmware errors on the affected GPU.

Remove the GPU from the system and replace it with a new one before any diagnostics.

Ignore the issue and proceed with production workloads if the other GPUs are operational.

Immediately re-run the firmware upgrade on all system components.

Question # 11

An engineer needs to completely remove NVIDIA GPU drivers from an Ubuntu 22.04 system to troubleshoot conflicts. Which command sequence ensures all driver components are purged?

sudo ubuntu-drivers uninstall

sudo rm -rf /usr/lib/nvidia

sudo apt-get remove nvidia-driver-550

sudo apt-get purge nvidia-* & & sudo apt-get autoremove

Question # 12

A DGX H100 system shows intermittent “Link Down” errors on a 200G DAC cable. CVT reports “No Signal” despite physical connection. What is the first hardware check?

Replace the switch’s optical transceiver with a higher-wattage model.

Reconfigure the port for 100G speeds via NVIDIA MST.

Upgrade all leaf switches to support RS-FEC.

Verify cable compatibility via the ConnectX-7 firmware validated adapters list and inspect connectors for damage.

Question # 13

During a 48-hour NeMo question-answering model burn-in test, GPU memory errors occur when processing large datasets. Which configuration strategy prevents Out-of-Memory (OOM) errors while maintaining processing efficiency?

Set blocksize= " 1GB " for data loading and enable RMM asynchronous allocation.

Switch from FP16 to FP32 precision for numerical stability.

Disable add_filename for Parquet files to reduce metadata.

Increase files_per_partition to 1000 for larger batch processing.

Question # 14

What is the purpose of using NCCL in verifying East-West fabric in an NVIDIA AI Factory?

Pick the 2 correct responses below.

To measure the storage network performance.

To measure the latency between GPUs.

To measure the power consumption of GPUs.

To measure bandwidth between GPUs.

Question # 15

Why is it important to provide a large and high-performance local cache (using SSDs configured as RAID-0) for deep learning workloads on DGX systems?

Local SSD cache allows users to increase the number of NFS threads on the server without impacting storage reliability.

Using local SSD cache in RAID-0 enables direct GPU access to files without host CPU involvement, further boosting performance.

Local SSD cache in RAID-0 is necessary to provide redundancy in case one of the drives fails during long training runs.

A local SSD cache in RAID-0 ensures that most training data is read only once from the network, significantly reducing NFS traffic.

Question # 16

A company has a registered NGC account and their server has NGC CLI installed. What step should be taken first to gain access to NGC?

ngc config get

ngc init

ngc config set

ngc config update

Question # 17

An engineer needs to verify the current firmware versions of all components (ATF, BSP, NIC, UEFI) on a BlueField-3 DPU ' s BMC. Which Redfish API command provides this information?

mlxconfig -d < dev > q

curl -k -u root: < password > -X GET https:// < DPU-BMC-IP > /redfish/v1/UpdateService/FirmwareList

mstflint -d < PCI_ID > query full

curl -k -u root: < password > -X GET https:// < DPU-BMC-IP > /redfish/v1/UpdateService/FirmwareInventory

Question # 18

A customer is designing an AI Factory for enterprise-scale deployments and wants to ensure redundancy and load balancing for the management and storage networks. Which feature should be implemented on the Ethernet switches?

Implement redundant switches with spanning tree protocol.

MLAG for bonded interfaces across redundant switches.

Use only one switch for all management and storage traffic.

Disable VLANs and use unmanaged switches.

Question # 19

A system administrator needs to configure a BlueField DPU and enable RShim on the baseboard management controller (BMC). Which command should be executed?

ipmitool raw 0x32 0x6a 1

systemctl restart rshim

systemctl enable bmc-rshim.service

scp < path_to_bfb > root@ < bmc_ip > :/dev/rshim0/boot

Question # 20

You are standing up an NVIDIA DGX system for enterprise production. Stakeholder teams require system reliability, performance consistency under load, and proper escalation processes before release. A recent system in another cluster experienced intermittent GPU failures attributed to missed early-stage validation. Which deployment and validation sequence best addresses production readiness and mitigates the risk of avoidable downtime or performance loss?

Install latest OS images and drivers, confirm OS and container functionality, invite users for a monitored production trial, and collect workload feedback to plan any further diagnostics or updates.

Complete hardware and cabling, power on the system, update firmware and drivers, run full hardware health checks and stress diagnostics using NVSM, verify all GPU and system sensor logs, and validate GPU accessibility.

Update network topology, assign static IPs and DNS entries, register the system with NVIDIA, then conduct basic OS-level checks and enable user access after login testing is successful.

Power on the system, install all AI frameworks, configure the CUDA and library stack, set up user environments, then plan stress tests and diagnostics as part of ongoing routine operations.

Question # 21

A 24-hour HPL burn-in fails with " illegal value " errors during the first iteration. Which initial troubleshooting step resolves this without compromising burn-in validity?

Switch from FP64 to FP32 precision.

Disable GPU affinity.

Reduce test duration to 12 hours.

Verify the matrix size is divisible by block size.

Question # 22

You are installing the operating system as part of the initial setup for a new NVIDIA Base Command Manager cluster. Which two of the following actions are essential for a successful OS installation on the cluster’s head node?

Pick the 2 correct responses below.

Download the latest BCM ISO and verify its integrity using the provided checksum, then start the installation.

Configure network switches for PXE boot to all compute nodes before installing the OS on the head node.

Set the desired time zone and configure NTP synchronization during the OS installation wizard.

Start the head node OS installation process with the system BIOS set to legacy boot mode instead of UEFI.

Question # 23

ClusterKit’s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?

Critical failure; expected is greater than 390 GB/s for HDR InfiniBand.

Suboptimal performance; requires FEC tuning to reach 380+ GB/s.

Optimal performance, indicating healthy fabric and GPUDirect RDMA.

Inconclusive; rerun with --stress=cpu to validate.

Question # 24

During a DGX cluster deployment, what is the most effective way to verify the health and integrity of the local RAID storage array?

Run a read/write benchmark utility, such as FIO, across the RAID array, looking for expected speed and latency metrics as proof of storage integrity.

Verify that all configured RAID volumes are mounted and available in the operating system, and that disk utilization levels are within recommended limits.

Use the mdadm --examine and mdadm --detail commands to review the RAID array’s status, checking for drive failures, array consistency, and error events.

Question # 25

A system administrator noticed a failure on a DGX H100 server. After a reboot, only the BMC is available. What could be the reason for this behavior?

The network card has no link / connection.

A boot disk has failed.

Multiple GPUs have failed.

There are more than two failed power supplies.

Question # 26

A leaf switch shows " FW Version Mismatch " alerts for transceivers after cluster expansion. Which tool validates transceiver firmware against expected versions?

flint

iblinkinfo

mlxconfig

ethtool

Question # 27

A network engineer is tasked with configuring the management, storage, and compute networks for a new DGX BasePOD deployment. Which statement best describes the network segmentation required for optimal operation?

A single VLAN for all types of network traffic.

Two networks: one for management and one for compute.

Four networks: compute, storage, out-of-band, and management.

Question # 28

After ClusterKit reports " GPU-Host latency exceeds threshold, " which NVIDIA diagnostic tool should be used to isolate hardware faults?

Re-run ClusterKit with --stress=gpu -Y 60 to extend test duration

nvidia-smi topo -m to inspect GPU topology connections

DCGM Diags dcgmi diag -r 2

ib_write_bw to measure InfiniBand bandwidth between nodes

Question # 29

A system administrator has upgraded the firmware of the DPU. What will be the state of the firmware after the upgrade?

The firmware is installed on the DPU.

The firmware is deleted from the DPU.

The firmware is copied to the DPU but not installed.

The firmware is waiting on reboot to become active.

Question # 30

A systems engineer is updating firmware across a large DGX cluster using automation. What is the best practice for minimizing risk and ensuring cluster health during and after the process?

Drain nodes from the scheduler, run pre-update diagnostics, update firmware in batches, and verify health post-update before scaling to the next batch.

To save time, simultaneously update all nodes in the cluster without draining or diagnostics.

Update nodes that have reported faults, leaving others on older firmware.

Drain nodes from the scheduler, update firmware in batches, skip diagnostics and verify health post-update before scaling to the next batch.

Question # 31

A team is validating a DGX BasePOD deployment. Using cmsh, they run a command to check GPU health across all nodes. What indicates that the system is ready for AI workloads?

The command output is ignored if the system powers on without errors.

At least half of the GPUs report Status_Health = OK.

All GPUs report Status_Health = OK and Health = OK for each device.

Only the head node ' s GPUs need to be healthy.

Question # 32

If two ports must be connected, but one is SFP and one is QSFP, for example, to connect a 25 GbE Host Channel Adapter to a QSFP port capable of both 100 GbE and 25 GbE, which solution would best meet this requirement?

QSA adapter.

SFP connectors.

SFP-to-1G BASE-T RJ45 adapter.

Standard QSFP-to-QSFP DAC cable.

Question # 33

Your tasked with updating both NVIDIA GPU drivers and DOCA drivers on a set of servers used for AI workloads. The environment previously had an older driver stack and custom kernel modules. What is the most important step to successfully upgrade the drivers without causing conflicts?

Update the GPU driver leaving the DOCA and OFED drivers unchanged as long as they are detecting the hardware properly.

Validate the driver version post-install since the fresh install will overwrite the legacy drivers.

Keep the older driver running alongside the new version in case you need to roll back the upgrade.

Uninstall all existing GPU and DOCA-related drivers and associated kernel modules before the new install.

Question # 34

Refer to the output:

~ $ sudo nvsm show healthinfo

—Timestamp: Sat Dec 16 16:26:32 2017 -0800

Version: 17.12-5

Checks—BIOS Revision [5.11].........................

DGX Serial Number [YSY72800016)..................

Verify installed DIMM memory sticks........................Healthy

...[output truncated)

Verify Ethernet controllers...........................Healthy

Verify installed GPU ' s..............................Unhealthy

Checking output of ' lspci ' for expected GPU ' s

Missing GPU at PCI address ' 07:00.0 '

Verify installed InfiniBand controllers....................Healthy

Verify PCIe switches..................................Healthy

...[output truncated)

What insights can a system administrator gain regarding the DGX system ' s health?

A GPU tray upgrade failed.

A GPU is missing on the DGX system.

A GPU driver upgrade has failed.

The system has passed the hardware health check successfully.

Question # 35

A system administrator receives an alert about a potential hardware fault on an NVIDIA DGX A100. The GPU performance seems degraded, and the system fans are operating loudly. What step should be recommended to identify and troubleshoot the hardware fault?

Run a deep learning workload to stress test the GPUs and check whether the issue persists.

Check the NVIDIA System Management Interface (nvidia-smi) for GPU status and temperatures.

Power drain then restart the DGX and check if the performance degradation resolves.

Increase the fan speed to maximum and check whether the performance improves.

Question # 36

What command sequence is used to identify the exact name of the server that runs as the master SM in a multi-node fabric?

sminfo, then smpquery ND

ibstat, then sminfo

ibnetdiscover, then ibsim

sminfo, then smpquery NI

Summer Special Sale - 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: spcl70

Crack4sure Logo

Main Navigation

Practice Free NCP-AII NVIDIA AI Infrastructure Exam Questions Answers With Explanation

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

NCP-AII PDF

$33

$109.99

NCP-AII PDF + Testing Engine

$52.8

$175.99

NCP-AII Engine

$39.6

$131.99

QUICK LINKS

SUPPORT

PAYMENT METHOD

Site Secure

CONTACT US