Using the Unicorn Cluster

Contents

WORK IN PROGRESS

Using the Unicorn Cluster

Unicorn cluster uses SchedMD’s SLURM scheduler on Ubuntu 24.04.

Email the ITSG for assistance.

Connect

Log into the cluster login node using your Cornell NetID and password from on-campus or Cornell VPN.
Use your favorite SSH client (Terminal, VSCode, MobaXterm, etc.)

ssh netid@unicorn-login-01.coecis.cornell.edu

Replace netid with your NetID

Login Node Notes:

  • The login node(s) are only designed to support logging into and submitting jobs to the scheduler.
  • Please use a quick interactive job to perform any tasks like compiling code or building conda environments.
  • To ensure the login nodes can serve the needs of the community, any process that consumes a large amount of resource is automatically terminated and the user notified by email that includes suggestions of some best practices.

Submit Jobs

Interactive job:

Get an interactive shell:

salloc

This uses the cluster default: 1 CPU, 1GB of RAM, 4-hour time limit on default_partition

You can request additional resources like this:

salloc --mem=5g --gres=gpu:1 --cpus-per-task=2

Disconnect from interactive sessions by pressing CTRL-D or typing exit

Scheduled job:

Create a SLURM submission script:

Example: test-gpu.sub

#!/bin/bash
#SBATCH -J test_file                         # Job name
#SBATCH -o test_file_%j.out                  # output file (%j expands to jobID)
#SBATCH -e test_file_%j.err                  # error log file (%j expands to jobID)
#SBATCH -N 1                                 # Total number of nodes requested
#SBATCH -n 1                                 # Total number of cores requested
#SBATCH --cpus-per-task=1                    # Total number of cores requested per task
#SBATCH --get-user-env                       # retrieve the users login environment
#SBATCH --mem=2000                           # server memory requested in MB (per node)
#SBATCH -t 2:00:00                           # Time limit (hh:mm:ss)
#SBATCH --partition=default_partition        # Request partition
#SBATCH --gres=gpu:r6000:1                  # Type/number of GPUs needed

echo "Hello, world! This is the GPU I'm using:"  # The commands or script to run
nvidia-smi -L

Submit the job:

sbatch --requeue test-gpu.sub

 

SBATCH Notes

  • Use –requeue for resubmission if preempted.
  • No blank lines between top of script and last #SBATCH line.
  • Use full paths for scripts (/home/netid/test.sh).

Monitor Jobs

See your active jobs:

squeue --me

To monitor your job, you can log into any node where you have a job running.

Cancel a job

scancel <jobid>

Review Jobs

The above example will generate output file in the form:
test_file_###.out with contents

Hello, world! This is the GPU I'm using:
GPU 0: Quadro RTX 6000 (UUID: GPU-a3f29002-16ac-d69e-185e-bec63d41ed44)

Important SLURM Commands

SchedMD SLURM quickstart

Command Purpose
srun Interactive job
squeue -l List active/pending jobs
scancel Cancel a job
sinfo Resource information
sinfo -o “%30N %10c %15m %30G” GPU information

Resource Limits and Defaults

Default resources (if unspecified):

  • 4-hour time limit
  • 1 CPU
  • 1 GB RAM
  • Default_partition

Note: Accurately specify resources (memory, CPU, GPU, partition, and time limit) to avoid termination by scheduler.

More Information

Directories

  • Home directories: /home/<NETID> from group NFS server.
  • Shared Data:  /share/DATASERVER/ (DATASERVER = your NFS server).
  • Shared datasets in /scratch/datasets.
  • Temporary storage for active jobs:
    • Tmp directory: /tmp, auto-monitored and cleared.
    • Scratch directories: Local GPU storage (/scratch), users manage cleanup.
  • Disk quotas are in place for some research groups; check usage with:
    • quota -s

Software

  • Slurm v24.11.4 (workload manager/job scheduler)
  • Anaconda3 (Python environment — /share/apps/anaconda3/2022.10 )
  • Apptainer v1.4.0 (formerly known as Singularity) (Docker compatibility — /share/apps/apptainer/1.4.0)
  • OpenMPI 5.0.7 (default MPI capability)
  • CUDA v12.8.1 – cuDNN v8.9.7
  • CUDA v12.0 – cuDNN v8.8.1 (Default CUDA installation)

Licensed Software

Often installed on specific nodes per research group request

  • Ansys Fluent
  • Ansys Lumerical
  • Gaussian
  • Gurobi Optimizer v12.0.1
  • MATLAB R2023a

Additional software

For software not listed above, we recommend using Anaconda (conda) which supports installing different software packages into virtual environments in your home directory for use across the cluster. Search for packages on https://anaconda.org/

Example conda environment creation for pytorch

Run this the first time you use conda:

/share/apps/software/anaconda3/bin/conda init

Follow prompts, then launch your shell again; ie: bash

Proceed here if you’ve already initialized conda

conda create -p ~/myenv python=3.12
conda activate ~/myenv
conda install pytorch::pytorch

Partitions and Priority

  • All jobs, including interactive jobs, must be submitted to a specific partition (or queue).
  • Preemption order. Please submit your job to the lowest priority that is needed.
  • default_partition for Low-Priority
    • For batch and interactive jobs requiring CPUs and/or GPUs, use the “default_partition” partition.
    • This partition can send jobs to any node on the cluster.
  • gpu for Medium-priority, GPU required
    • For batch and interactive jobs requiring GPUs, use the “gpu” partition.
    • This partition can send jobs to any node with a GPU on the cluster.
    • This partition will preempt any jobs running on a GPU server which were submitted to the Low-Priority partition.
  • High-priority: for batch and interactive jobs requiring CPUs and/or GPUS, use the priority queue that belongs to your group.
    • Only the servers owned by the faculty to whom this priority partition belongs will be available through these partitions.
    • This partition will preempt any jobs running on any server (owned by faculty to which it belongs) which were submitted to the Low/Medium-Priority partitions.

You are limited to the resources that your original job requested Ex: If your original job asked for 1 GPU, that is all you will be able to see/use if you log into the node. Once the job ends, you’ll no longer be able to directly access the node.