Calculating on GPU ================== PSMN offers L4 (Cascade-GPU) and RTX2080Ti (E5-GPU), see :doc:`../clusters_usage/computing_resources` for more detailed information about specifications. Basic Commands -------------- ``nvidia-smi`` will show information about the NVIDIA GPU. .. code-block:: console $ nvidia-smi +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 2080 Ti On | 00000000:82:00.0 Off | N/A | | 28% 25C P8 1W / 250W | 1MiB / 11264MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ ``nvtop`` , similar to the ``top`` command, will give you real-time information about processes running on the GPU. .. code-block:: console $ nvtop Device 0 [NVIDIA GeForce RTX 2080 Ti] PCIe GEN 1@16x RX: 0.000 KiB/s TX: 0.000 KiB/s GPU 300MHz MEM 405MHz TEMP 24°C FAN 28% POW 1 / 250 W GPU[ 0%] MEM[| 0.248Gi/11.000Gi] ┌────────────────────────────────────────────────────────────────────────────────────────────────┐ 100│ GPU 0│ 75%│ MEM│ │ │ 50%│ │ │ │ 25%│ │ 0%│────────────────────────────────────────────────────────────────────────────────────────────────│ └────────────────────────────────────────────────────────────────────────────────────────────────┘ PID USER GPU TYPE GPU MEM CPU HOST MEM Command Getting to know your GPU with CUDA ---------------------------------- CUDA is a proprietary application programming interface (API) from NVIDIA that can help you optimize your use of GPU devices by: - Specifiying thread parallelism - Optimizing memory access patterns - Managing occupancy .. TIP:: To run NVIDIA CUDA you need to be connected to a node with a CUDA-capable GPU and load a gcc compiler and toolchain. You can gather fundamental information about the GPU by running the deviceQuery program (see below). Compiled CUDA Sample Programs ----------------------------- Samples of compiled CUDA programs for RTX2080Ti GPU are available at: ``/applis/PSMN/debian11/CUDA/RTX2080Ti`` Samples of compiled CUDA programs for L4 GPU are available at: ``/applis/PSMN/debian11/CUDA/L4`` All source code for programs can be found at: https://github.com/NVIDIA/cuda-samples/tree/master/Samples **List of CUDA sample programs available:** * bandwidthTest This is a simple test program to measure the memcopy bandwidth of the GPU and memcopy bandwidth across PCI-e. This test application is capable of measuring device to device copy bandwidth, host to device copy bandwidth for pageable and page-locked memory, and device to host copy bandwidth for pageable and page-locked memory. * clock This example shows how to use the clock function to measure the performance of block of threads of a kernel accurately. * deviceQuery This sample enumerates the properties of the CUDA devices present in the system. * deviceQueryDrv This sample enumerates the properties of the CUDA devices present using CUDA Driver API calls. * eigenvalues The computation of all or a subset of all eigenvalues is an important problem in Linear Algebra, statistics, physics, and many other fields. This sample demonstrates a parallel implementation of a bisection algorithm for the computation of all eigenvalues of a tridiagonal symmetric matrix of arbitrary size with CUDA. * graphMemoryFootprint This sample demonstrates how graph memory nodes re-use virtual addresses and physical memory. * matrixMul This sample implements matrix multiplication and is exactly the same as Chapter 6 of the programming guide. It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant generic kernel for matrix multiplication. To illustrate GPU performance for matrix multiply, this sample also shows how to use the new CUDA 4.0 interface for CUBLAS to demonstrate high-performance performance for matrix multiplication. * MonteCarloMultiGPU This sample evaluates fair call price for a given set of European options using the Monte Carlo approach, taking advantage of all CUDA-capable GPUs installed in the system. This sample uses double precision hardware if a GTX 200 class GPU is present. The sample also takes advantage of CUDA 4.0 capability to supporting using a single CPU thread to control multiple GPUs. * simpleOccupancy This sample demonstrates the basic usage of the CUDA occupancy calculator and occupancy-based launch configurator APIs by launching a kernel with the launch configurator, and measures the utilization difference against a manually configured launch. * topologyQuery A simple example on how to query the topology of a system with multiple GPU * vectorAdd This CUDA Runtime API sample is a very basic sample that implements element by element vector addition. It is the same as the sample illustrating Chapter 3 of the programming guide with some additions like error checking. Compiling Programs with Cuda (NVCC) ----------------------------------- You can compile your code with the NVIDIA CUDA Compiler Driver (NVCC). .. code-block:: bash $ nvcc -o clock clock.cu -I/path/to/custom/library/include Additional documentation for the CUDA compiler is available here: https://docs.nvidia.com/cuda/pdf/CUDA_Compiler_Driver_NVCC.pdf .. IMPORTANT:: Our NVIDIA GeForce RTX 2080 Ti GPUs currently have driver version 535.183.01, which is compatible with CUDA 12.2 Toolkit. You *cannot* currently compile a program *with* NVCC with a Cuda toolkit greater than 12.2. Toolchain compatibility with CUDA Toolkit 12.2: - GCC 10.2.0 - Clang 9.0.0