- Graphite is an Ubuntu 16.04 LTS cluster using a SLURM workload manager running various versions of CUDA.
- In this cluster, participating groups’ nodes are added to both member and private preemptive queues. All participating member nodes are available for use by the member community; however, an individual group’s jobs have immediate preemptive priority reclaiming their node to their priority (private) queue as needed if in use by others.
- If you are a member of a research group that has Graphite research nodes, you may request an account via the help-ticket system.
- Please explicitly include your research group information in the ticket.
- cc your PI on the request.
- Users log in with their Cornell NetID credentials.
- Assume that there are no backups of any data. Some research groups may subscribe to EZ-Backup but most do not.
- Home directories are automounted onto /home from the NAS that belongs to your research group
- Data directories are automounted in /share/DATASERVER/export, where DATASERVER is the NFS server where your research data is stored
- Scratch directories reside in /scratch.
- This is a local directory on each server and is not shared across the cluster.
- There is a folder named /scratch/datasets that can hold common datasets and is usable by all groups.
- The size of the /scratch directory depends upon the server upon which it resides.
- Anaconda3 (Python 3.8 version)
- CUDA v11.0 || cuDNN v8.0.2
- CUDA v10.2 || cuDNN v7.6.5
- CUDA v10.1 || cuDNN v7.6.5
- CUDA v10.0
- CUDA v9.2
- CUDA v8.0
- Log into the cluster via SSH at graphite-login.coecis.cornell.edu using your Cornell NetID and password.
- All jobs, including interactive jobs, must be submitted to a specific partition (or queue).
- Batch jobs: use the “default_gpu” partition.
- Interactive jobs: use the “interactive” partition.
- High priority jobs: use the priority queue that belongs to your group
Create a SLURM Submission Script:
Optional entries here can include:
#SBATCH --partition=default_gpu # Request partition for resource allocation --partition specifies which partition the job should run on where <queue name> can be: default_gpu interactive <group name> - for example kilian or ramin
#SBATCH --gres=gpu:1 # Specify a list of generic consumable resources (per node) -–gres specifies a list of generic consumable resources (per node) --gres=gpu:1080ti:1 means one gpu of type GeForce GTX 1080Ti --gres=gpu:2 means two gpus of type any
Create a shell script to be run on the cluster:
Submit the job:
- When submitting a job to either the “default_gpu” or “interactive” partition, there is a possibility that a job may be preempted.
- Use the switch “–requeue” with the sbatch command, and the job will be resubmitted if it is preempted.
- It is important to tell the scheduler what resources the job will need.
- The scheduler does not necessarily use the numbers given to control the job, but it makes sure that jobs will not be scheduled on nodes that CANNOT support them or that don’t have the resources requested available (if each job accurately requests the resources needed).
- It is also important to tell the application what resources it can use.
- For example, if you do not limit a MATLAB job, it will use every core on every server that it is running on.
- Please either request every core for the job, or tell MATLAB to limit its use.
- The cluster scheduler is currently set up to kill a job that tries to use too much memory (more memory than the job asked for).
- This behavior can be changed, but please be mindful to properly set parameters before scheduling a job.
- srun # When using srun, select the “interactive” partition.
- squeue -l # Get list of your active or pending jobs.
- scancel 183 # Cancel an existing job, where 183 is the job number retrieved by the squeue command.
- sinfo -o %G,%N,%P # Get info on GPUs available, the nodelist they are on and the partition to use.
- sinfo # Get info on compute resources.
Starting a Jupyter notebook session (Tunneling the notebook):
Be sure the Anaconda environment is defined by adding the following line to your ~/.bashrc file or by typing it on the command line before the rest of this process is executed.
On the user’s local machine, create an SSH tunnel between the users’s local machine and NODE, on PORT (where NODE is the name of the server you plan to connect to and PORT is a single unused port number between 8000 and 10000).
ssh email@example.com -LPORT:NODE:PORT
On the graphite login node, start an interactive session to NODE (defined above) using the interactive partition
srun -p interactive --pty --nodelist=NODE /bin/bash
Once logged into NODE (defined above), start jupyter-notebook for the first time (where /tmp/use_your_netid means /tmp/”Cornell NetID”)
XDG_RUNTIME_DIR=/tmp/use_your_netid jupyter-notebook --ip=0.0.0.0 --port=PORT
Open a browser on the user’s local machine using the string containing “127.0.0.1:” displayed by the jupyter-notebook command in the previous step. It will look similar to the following link.
Graphite CPU Environment
A CPU-only environment is available for Graphite. More information can be found here: Graphite CPU MPI Environment