User Tools

Site Tools


gpu_nodes_and_how_to_work_with_them

This is an old revision of the document!


Author: Dmytro Tymoshenko | Last update: November 10, 2025

If you've been wondering why your GPU jobs aren't running or you're seeing them on some random nodes instead of perun24, this guide will help you switch to the proper workflow.

Quick Start (TLDR)

Currently, we have only 1 GPU node: perun24. Information about all nodes can be found here. Key information about it:

CPU: 24 threads
H100 NVL (94GB VRAM)
512 GB RAM

To work with the GPU node, either:

  • SSH directly to perun24 for interactive sessions.
  • In your script header, define submission queue as #$ -q GPU.

Pro tip: add echo “Hostname: $(hostname)” to your script for the node name to appear in the output.

Understanding the Current GPU Setup

Good Ol' Days

When the GPU node was first introduced, we had to SSH to it and submit jobs directly from there — it was isolated in its system: the GPU node's queue was not visible from other cluster nodes and vice versa. Queue management was fragmented between the main cluster and the GPU node, resulting in limited monitoring, management and submission opportunities.

Current setup

In the current implementation, perun24 is fully integrated into the main SGE scheduler and has its queue, as visible in qconf -sql. No more SSH juggling between nodes and proper monitoring, which results in flexible management of scripts.

Now, to request a GPU node you can:

  • Modify header of your script (recommended) so you can submit it as qsub <script.sh>:
#!/bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -q GPU         # This line will define the needed queue
...
  • Submit a script with the following command if the header of the script is not modified:
> qsub -q GPU <script.sh>

How Can I Benefit From the New Setup?

For example, lets imagine we have a pipeline of two scripts, script_1.sh and script_2.sh, where script_2.sh works on the output of script_1.sh. script_1.sh can be run on CPU/requires extensive resources not available on perun24/or will run for an extremely long time, and script_2.sh needs GPU. You don't want hogging GPU node for script_1.sh, and appropriate queues were defined in the headers of the scripts. Now you can maximize efficiency and submit them simultaneously with -hold_jid:

username@perun: qsub script_1.sh
username@perun: qstat | grep $USER      # We are looking for ID value from job-ID column
job-ID  prior   name           user     state     submit/start at          queue          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 500   0.54090 script_1.sh   username     r     10/10/2025 12:12:12     2T-batch@perun23         15        

username@perun: qsub -hold_jid 500 script2.sh

Or using -terse:

username@perun: job1_id=$(qsub -terse script_1.sh) # Submits script_1.sh and assigns job-ID to job1_id
username@perun: qsub -hold_jid $job1_id script_2.sh

This will result in script_2.sh running when script_1.sh will finish.

Existing Limitations

  • The current scheduler uses queue-based GPU access rather than resource flags. Commands like #$ -l gpu=1 will not work — you can verify this by running qhost -F gpu, which shows GPU resources aren't configured for direct resource requests. Instead, use the mentioned #$ -q GPU queue as described above.

Contacts

Questions? Suggestions? Assistance? > Dmytro Tymoshenko

gpu_nodes_and_how_to_work_with_them.1762796804.txt.gz · Last modified: by dmytrot