Skip to end of metadata
Go to start of metadata


GPU Resources in O2

6 GPU nodes are available on O2, including these GPU cards: 8 Tesla V100, 8 Tesla M40 and 16 Tesla K80 . To list information about all the nodes with GPU resources you can use the command: 

login01:~$ sinfo --Format=nodehost,cpusstate,memory,gres|grep 'HOSTNAMES\|gpu'
HOSTNAMES           CPUS(A/I/O/T)       MEMORY              GRES
compute-g-16-254    0/32/0/32           373760              gpu:teslaV100:4
compute-g-16-255    0/32/0/32           373760              gpu:teslaV100:4
compute-g-16-175    11/9/0/20           257548              gpu:teslaM40:4
compute-g-16-176    18/2/0/20           257548              gpu:teslaM40:4
compute-g-16-194    5/15/0/20           257548              gpu:teslaK80:8
compute-g-16-177    0/24/0/24           257548              gpu:teslaK80:8

GPU Partition Limits

The following limits are applied to this partition in order to facilitate a fair use of the limited resources:

GPU hours

The amount of GPU resources that can be used by each user at any time on the O2 cluster is measured in terms of GPU hours / user, currently there is an active limit of 160 GPU hours for each user.

For example at any time each user can allocate* at most 1 GPU card for 120 (due to the partition wall time limit), 2 GPU cards for 80 hours,16 GPU cards for 10 hours or any other combination that does not exceed the total GPU hours limit. 

* as resources allow 

Memory

Each user can have a total of up to 420 GB of memory allocated for all currently running GPU jobs

CPU cores

Each user can have a total of up to 34 cores allocated for all currently running GPU jobs


Those limits will be adjusted as our GPU capacity evolves.  If those limits are reached by running jobs any remaining pending jobs will display AssocGrpGRESRunMinutes in the NODELIST(REASON) field.

How to compile cuda programs

In most cases a cuda library and compiler module must be loaded in order to compile cuda programs. To see which cuda modules are available use the command module spider cuda, then use the command module load to load the desired version. Currently only the latest version of Cuda toolkit (V 9) is available 


login04:~ module spider cuda

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  cuda:
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Versions:
        cuda/8.0
        cuda/9.0
        cuda/10.0
        cuda/10.1
        cuda/10.2

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  For detailed information about a specific "cuda" module (including how to load the modules) use the module's full name.
  For example:

     $ module spider cuda/9.0
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------




login01:~ module load gcc/6.2.0 cuda/10.1


How to submit a GPU job

Most GPU application will require access to CUDA Toolkit libraries, so before submitting a job you will likely need to load one of the available CUDA modules, for example:

login01:~ module load gcc/6.2.0 cuda/10.1


Note that if you are running a precompiled GPU application, for example a pip-installed Tensorflow, you will need to load the same version of CUDA that was used to compile your application (Tensorflow==2.2.0 was compiled using CUDA 10.1)

To submit a GPU job on O2, use the partition gpu and add a flag like --gres=gpu:1 to request a GPU resource. The example below starts an interactive bash job requesting 1 CPU core and 1 GPU card. This starts a session on one of the GPU-containing nodes, where you can test and debug programs that use GPU.

login01:~ srun -n 1 --pty -t 1:00:00 -p gpu --gres=gpu:1 bash

srun: job 6900282 queued and waiting for resources
srun: job 6900282 has been allocated resources
compute-g-16-176:~


While this other example submits a batch job requesting 2 GPU cards and 4 CPU cores:

login01:sbatch gpujob.sh
Submitted batch job 6900310


where gpujob.sh contains


#-----------------------------------------------------------------------------------------
#!/bin/bash
#SBATCH -c 4
#SBATCH -t 6:00:00
#SBATCH -p gpu
#SBATCH --gres=gpu:2

module load gcc/6.2.0
module load cuda/9.0

./deviceQuery  #this is just an example 


#-----------------------------------------------------------------------------------------


It is also possible to request a specific type of GPU card by using the --gres flag. For example --gres=gpu:teslaM40:3 can be used to request 3 GPU Tesla M40 cards. Currently three GPU types are available: teslaM40teslaK80 and teslaV100.





  • No labels