CILVR Compute Servers

Sheduling jobs using Slurm:

Jobs must be scheduled using the Slurm workload manager through the control node, cassio.cs.nyu.edu. Access to this cluster and the control node is restricted to those who are part of the CILVR group.

If you are on a network outside of CIMS, you will have to first login to access.cims.nyu.edu, then use ssh to get to cassio.cs.nyu.edu. Once you launch a job, you will be able to ssh directly to any of the assigned nodes.

Slurm's documentation is very thorough, this quickstart guide should provide enough information to start exploring our Slurm setup.

There are currently two Quality Of Service (QOS) policies which affect the jobs run via the scheduler:

(1) Interactive: Each student can start an interactive job using up to 2 GPU's. It can be one interactive\ job with two GPU's or two interactive jobs with one GPU each. Each job can be run up to 1 week. This should be used mainly for development.

(2) Batch: This is the default QOS. You can use up to 8 GPU's for batch jobs. Similar to the interactive mode, you can launch as many jobs as you want as long as the total number of allocated GPU's does not exceed eight. Each job can be run up to 2 days. This decision was made to ensure the fair allocation among users. Please use check-pointing for any longer job.

Below are a few command examples that are relevant to our environment:

-Request a "batch" QOS bash shell session with one model "titanblack" GPU:

srun --qos=batch --gres=gpu:titanblack:1 --pty bash

Currently, the GPU models you can specify using the --gres option are 1080ti, titanxp, titanblack, k40, k20, k20x, m2090

- Request 2 GPU cards by their Memory size by using the --constraint option along with the associated "feature" label:

srun --qos=interactive --gres=gpu:2 --constraint=gpu_12gb --pty bash

- Use boolean operators with the --constraint option to group feature requests:

srun --qos=interactive --gres=gpu:2 --constraint="gpu_12gb&kepler" --pty bash

- Show all nodes along with their associated state, number of CPUs, Memory (in MB), and "features":

sinfo -o --long --Node --format="%.6N %.8T %.4c %.10m %.20f"

- squeue displays job status, and can be formatted like sinfo:

squeue -l --format="%.5i %.15q %.6j %.6b %.6D %.6N %.25S %.16L"

- You can launch a job by creating a job script and running it using the sbatch command; below is an example of an sbatch script ("tf.sbatch") taken from http://sherlock.stanford.edu/mediawiki/index.php/Tensor_flow that that sends an email notification when the job completes; it has been modified it to run a tensorflow python script in our environment :

#!/bin/bash
#
# all commands that start with SBATCH contain commands that are just used by SLURM for
scheduling
#################
# set a job name
#SBATCH --job-name=GPUTFRtest
#################
# a file for job output, you can check job progress
#SBATCH --output=GPUTFtest.out
#################
# a file for errors from the job
#SBATCH --error=GPUTFtest.err
#################
# time you think you need; default is one hour
# in minutes
# In this case, hh:mm:ss, select whatever time you want, the less you ask for the # faster
your job will run.
# Default is one hour, this example will run in less that 5 minutes.
#SBATCH --time=15:00
#################
# --gres will give you one GPU, you can ask for more, up to 4 (or how ever many are on the
node/card)
#SBATCH --gres gpu:2
# We are submitting to the batch partition
#SBATCH --qos=batch
#################
#number of nodes you are requesting
#SBATCH --nodes=1
#################
#memory per node; default is 4000 MB per CPU
#SBATCH --mem=4000
#################
# Have SLURM send you an email when the job ends or fails, careful, the email could end up
in your clutter folder
#SBATCH --mail-type=END,FAIL # notifications for job done & fail
#SBATCH --mail-user=user@courant.nyu.edu

Please note: before using python3 you will need to load it using our module system:

module load python-3

srun python3 ./tf_test.py

NVIDIA provides online GPU computing seminars.

Let us know if there are any other specific examples you'd like us to provide or if you need anything else to get started.

If you run into problems, please contact helpdesk@cims.nyu.edu.