Calculating on GPU
==================

PSMN offers L4 (Cascade-GPU) and RTX2080Ti (E5-GPU), see :doc:`../clusters_usage/computing_resources` for more detailed information about specifications.


Basic Commands
--------------

``nvidia-smi`` will show information about the NVIDIA GPU. 

.. code-block:: console

    $ nvidia-smi
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  NVIDIA GeForce RTX 2080 Ti     On  | 00000000:82:00.0 Off |                  N/A |
    | 28%   25C    P8               1W / 250W |      1MiB / 11264MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+


``nvtop`` , similar to the ``top`` command, will give you real-time information about processes running on the GPU.

.. code-block:: console

    $ nvtop 
    Device 0 [NVIDIA GeForce RTX 2080 Ti] PCIe GEN 1@16x RX: 0.000 KiB/s TX: 0.000 KiB/s
    GPU 300MHz  MEM 405MHz  TEMP  24°C FAN  28% POW   1 / 250 W
    GPU[                                  0%] MEM[|                    0.248Gi/11.000Gi]
    ┌────────────────────────────────────────────────────────────────────────────────────────────────┐
 100│                                                                                           GPU 0│
 75%│                                                                                             MEM│
    │                                                                                                │
 50%│                                                                                                │
    │                                                                                                │
 25%│                                                                                                │
  0%│────────────────────────────────────────────────────────────────────────────────────────────────│
    └────────────────────────────────────────────────────────────────────────────────────────────────┘
        PID USER GPU    TYPE        GPU MEM    CPU  HOST MEM Command


Getting to know your GPU with CUDA
----------------------------------

CUDA is a proprietary application programming interface (API) from NVIDIA that can help you optimize your use of GPU devices by:

- Specifiying thread parallelism
- Optimizing memory access patterns
- Managing occupancy


.. TIP:: To run NVIDIA CUDA you need to be connected to a node with a CUDA-capable GPU and load a gcc compiler and toolchain.

You can gather fundamental information about the GPU by running the deviceQuery program (see below).


Compiled CUDA Sample Programs
-----------------------------
Samples of compiled CUDA programs for RTX2080Ti GPU are available at: ``/applis/PSMN/debian11/CUDA/RTX2080Ti``

Samples of compiled CUDA programs for L4 GPU are available at: ``/applis/PSMN/debian11/CUDA/L4``

All source code for programs can be found at: https://github.com/NVIDIA/cuda-samples/tree/master/Samples


**List of CUDA sample programs available:**


* bandwidthTest

This is a simple test program to measure the memcopy bandwidth of the GPU and memcopy bandwidth across PCI-e. This test application is capable of measuring device to device copy bandwidth, host to device copy bandwidth for pageable and page-locked memory, and device to host copy bandwidth for pageable and page-locked memory.

* clock

This example shows how to use the clock function to measure the performance of block of threads of a kernel accurately.

* deviceQuery

This sample enumerates the properties of the CUDA devices present in the system.

* deviceQueryDrv

This sample enumerates the properties of the CUDA devices present using CUDA Driver API calls.

* eigenvalues

The computation of all or a subset of all eigenvalues is an important problem in Linear Algebra, statistics, physics, and many other fields. This sample demonstrates a parallel implementation of a bisection algorithm for the computation of all eigenvalues of a tridiagonal symmetric matrix of arbitrary size with CUDA.

* graphMemoryFootprint

This sample demonstrates how graph memory nodes re-use virtual addresses and physical memory.

* matrixMul

This sample implements matrix multiplication and is exactly the same as Chapter 6 of the programming guide. It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant generic kernel for matrix multiplication.  To illustrate GPU performance for matrix multiply, this sample also shows how to use the new CUDA 4.0 interface for CUBLAS to demonstrate high-performance performance for matrix multiplication.

* MonteCarloMultiGPU

This sample evaluates fair call price for a given set of European options using the Monte Carlo approach, taking advantage of all CUDA-capable GPUs installed in the system. This sample uses double precision hardware if a GTX 200 class GPU is present. The sample also takes advantage of CUDA 4.0 capability to supporting using a single CPU thread to control multiple GPUs.

* simpleOccupancy

This sample demonstrates the basic usage of the CUDA occupancy calculator and occupancy-based launch configurator APIs by launching a kernel with the launch configurator, and measures the utilization difference against a manually configured launch.

* topologyQuery

A simple example on how to query the topology of a system with multiple GPU

* vectorAdd

This CUDA Runtime API sample is a very basic sample that implements element by element vector addition. It is the same as the sample illustrating Chapter 3 of the programming guide with some additions like error checking.


Compiling Programs with Cuda (NVCC)
-----------------------------------

You can compile your code with the NVIDIA CUDA Compiler Driver (NVCC).

.. code-block:: bash

    $ nvcc -o clock clock.cu -I/path/to/custom/library/include

Additional documentation for the CUDA compiler is available here: https://docs.nvidia.com/cuda/pdf/CUDA_Compiler_Driver_NVCC.pdf


.. IMPORTANT:: 
    Our NVIDIA GeForce RTX 2080 Ti GPUs currently have driver version 535.183.01, which is compatible with CUDA 12.2 Toolkit.
    You *cannot* currently compile a program *with* NVCC with a Cuda toolkit greater than 12.2.


Toolchain compatibility with CUDA Toolkit 12.2:

- GCC 10.2.0
- Clang 9.0.0