Using the Graphite GPU Environment

Introduction

  • graphite.coecis.cornell.edu is an Ubuntu 16.04 LTS cluster using a SLURM workload manager running CUDA 9.2.
  • In this cluster, participating groups’ nodes are added to both member and private preemptive queues.  All participating member nodes are available for use by the member community; however, an individual group’s jobs have immediate preemptive priority reclaiming their node to their priority (private) queue as needed if in use by others.
  • If you are a member of a research group that has GRAPHITE research nodes, you may request an accounts via the help-ticket system.
    • Please explicitly include your research group information in the ticket.
    • cc your PI on the request.
  • Users log in with their Cornell NetID credentials.
  • Assume that there are no backups of any data.  Some research groups may subscribe to EZ-Backup but most do not.

Directories

  • Home directories are automounted onto /home from the NAS that belongs to your research group
  • Data directories are automounted in /share/DATASERVER/export, where DATASERVER is the NFS server where your research data lived before being brought into the cluster.
  • Scratch directories reside in /scratch.
    • This is a local directory on each server and is not shared across the cluster.
    • There is a folder named /scratch/datasets that can hold common datasets and is usable by all groups.
    • The size of the /scratch directory depends upon the server upon which it resides.

Software

  • Anaconda (Python 2.7 version)
  • Anaconda (Python 3.7 version)
  • CUDA v10.1 || cuDNN v7.5.1
  • CUDA v9.2
  • CUDA v8.0

Using Graphite

  • All jobs, including interactive jobs, must be submitted to a specific partition (or queue).
    • Batch jobs: use the “default_gpu” partition.
    • Interactive jobs: use the “interactive” partition.
    • High priority jobs: use the priority queue that belongs to your group

Create a SLURM Submission Script:

Example: test-gpu.sub
#!/bin/bash
#SBATCH -J test_file                         # Job name
#SBATCH -o test_file_%j.out                  # Name of stdout output log file (%j expands to jobID)
#SBATCH -e test_file_%j.err                  # Name of stderr output log file (%j expands to jobID)
#SBATCH -N 1                                 # Total number of nodes requested
#SBATCH -n 1                                 # Total number of cores requested
#SBATCH --mem=15000                          # Total amount of (real) memory requested (per node)
#SBATCH -t 48:00:00                          # Time limit (hh:mm:ss)
#SBATCH --partition=default_gpu              # Request partition for resource allocation
#SBATCH --gres=gpu:1                         # Specify a list of generic consumable resources (per node)
cd /home/netid; ./test-datasets.sh

Optional entries here can include:

#SBATCH --partition=default_gpu              # Request partition for resource allocation
--partition specifies which partition the job should run on where <queue name> can be:
	default_gpu
	interactive
	<group name> - for example kilian or ramin

#SBATCH --gres=gpu:1                         # Specify a list of generic consumable resources (per node)
-–gres specifies a list of generic consumable resources (per node)
--gres=gpu:1080ti:1 means one gpu of type GeForce GTX 1080Ti
--gres=gpu:2 means two gpus of type any

Create a shell script to be run on the cluster:

Example: test-datasets.sh
/share/apps/anaconda3/5.2.0/bin/python /home/"netid"/"cluster name"/code/examples/imagenet/main.py -a alexnet --lr 0.01 /home/"netid"/"clustername"/datasets/imagenet

Submit the job:

sbatch --requeue test-gpu.sub

Scheduler Notes:

  • When submitting a job to either the “default_gpu” or “interactive” partition, there is a possibility that a job may be preempted.
    • Use the switch “–requeue” with the sbatch command, and the job will be resubmitted if it is preempted.
  • It is important to tell the scheduler what resources the job will need.
    • The scheduler does not necessarily use the numbers given to control the job, but it makes sure that jobs will not be scheduled on nodes that CANNOT support them or that are already too busy (if each job accurately requests the resources needed).
  • It is also important to tell the application what resources it can use.
    • For example, if you do not limit a MATLAB job, it will use every core on every server that it is running on.
    • Please either request every core for the job, or tell MATLAB to limit its use.
  • The cluster scheduler is currently set up to kill a job that tries to use too much memory (more memory than the job asked for).
    • This behavior can be changed, but please be mindful to properly set parameters before scheduling a job.

SLURM Commands:

  • srun                                                                     # When using srun, select the “interactive” partition.
  • squeue -l                                                             # Get list of active or recently completed jobs.
  • scancel 183                                                         # Cancel an existing job, where 183 is the job number retrieved by the squeue command.
  • sinfo -o %G,%N,%P                                             # Get info on GPUs available, the nodelist they are on and the partition to use.
  • sinfo                                                                     # Get info on compute resources.

Jupyter Notebook Information (Tunneling the notebook):

Be sure the anaconda environment is defined in your ~/.bashrc file by adding one of the following lines to it.

  • export PATH=/share/apps/anaconda2/5.2.0/bin:$PATH
  • export PATH=/share/apps/anaconda3/5.2.0/bin:$PATH

Starting a Jupyter notebook session

# tunnel NODE port 8881 back to the user’s local machine

  • ssh netid@graphite.coecis.cornell.edu -L8881:NODE:8881

# start an interactive session to NODE

  • srun –pty –nodelist=NODE /bin/bash

# start Jupyter notebook for the first time

  • XDG_RUNTIME_DIR=/tmp jupyter-notebook –ip=0.0.0.0 –port=8881

# Open a browser on the user’s local machine using 127.0.0.1:8881 along with the token provided in the previous step.  For example:

  • http://127.0.0.1:8881/?token=2b2319597a034d0b9e06193aa17f69571a45f539fe69dda5&token=2b2319597a034d0b9e06193aa17f69571a45f539fe69dda5

Note on using Jupyter Notebook Ports

Port 8881 was used in this example, however, user’s will want to pick a high numbered port at random for their instance so as not to conflict with other sessions.

Graphite CPU Environment

A CPU environment is under development for Graphite.  More information can be found here: Graphite CPU MPI Environment