Using the Magma Compute Environment

Introduction

  • The Magma cluster is built with RocksCluster (V7.0) using the Slurm (V19.05.02) scheduler running on CentOS 7.x operating system. There are currently ~500 compute cores (CPUs) total in the magma cluster.
  • In this cluster, participating groups’ nodes are added to both member and private preemptive partitions/queues. All participating member nodes are available for use by the member community via the default partition/queue; however, an individual group’s jobs have immediate preemptive priority reclaiming their node to their priority (private) partition/queue as needed if in use by others.
  • If you are a member of a research group that has MAGMA research nodes, you may request an account via the help-ticket system. Please explicitly include your research group information and cc your PI with the request.
  • Users log into magma-login.coecis.cornell.edu with their Cornell NetID credentials.
  • Please assume that there are no backups of any data.
  • All jobs are submitted via Slurm, a specialized workload management system for compute-intensive jobs. This is a batch system where users submit batch and interactive jobs to default, interactive and priority partitions/queues. Interactive jobs will reserve resources just as batch jobs will and are ruled by the same scheduling algorithms.

User Access

File Systems

  • Users home directories may live on project or shared NAS’s. These file systems are shared via NFS and are NOT backed up. Home directories and select data directories are mounted on all compute nodes.
  • Local compute-node file systems accessible to users include /tmp & /state/partition1. /tmp (Size dependent upon disk available on each node) is for temporary run-time data and will be cleaned out after a reboot or after the data becomes older than ~30 days.
  • /share/apps is where applications are installed. /share/apps is available on all the compute nodes.
  • Research groups may also have a shared directory available as /share/research_group_name.

Where is MATLAB and how do I use it on compute nodes?

  • Matlab is installed on the login node in /opt/matlab and it isn’t available directly on the compute nodes. Our recommendation and best practice for Matlab cluster use is to employ the MATLAB Compiler http://www.mathworks.com/products/compiler/ This allows users to build Matlab executables and then later run the executable without having Matlab locally installed. This is the case with a cluster environment where installing Matlab for every node is prohibitively expensive. The MATLAB compiler is included as one of the standard tool boxes that Cornell purchases.

Can I get Scipy (or some other tool) installed?

  • Anaconda3 is available in /share/apps/anaconda/anaconda3/. Use anaconda to build into your home directory the desired python environment and packages.
  • For other packages that would help you with your research and are available via academic licensing, open a help-ticket and we’ll work on getting it installed. We can potentially install commercial software if a license is purchased and honored.

Usage

Logging into the cluster

  • To request an account:
    • If you are a member of a research group that has Graphite research nodes, you may request an account via the help-ticket system.
      • Please explicitly include your research group information in the ticket.
      • cc your PI on the request.
    • If the PI does not own any equipment in the cluster, a decision will be made after looking at why you want the account, if the software required is available and how long you want it for.
  • Connect to the login node.
    • Use ssh and connect to magma-login.coecis.cornell.edu
    • Users log in with their Cornell NetID credentials

 

Submitting a job

Note that jobs need to be submitted from the login node via a job submission script.

The submission script can call a system file or a shell script written by you.

EX: job_submission.sub

************************************************

#!/bin/bash

#SBATCH -J host # Job name

#SBATCH -o host.o%j # Name of stdout output file(%j expands to jobId)

#SBATCH -e host.e%j # Name of stderr output file(%j expands to jobId)

#SBATCH -n 1 # Total number of cores requested

#SBATCH –get-user-env # tells sbatch to retrieve the users login environment.

#SBATCH –mem=1000 # Requested amount of memory. The max your job will need.

#SBATCH –partition=default_cpu # Which queue it should run on.

/bin/hostname # Name of compiled executable/shell script.

************************************************

 

EX: job_submission_MPI_job.sub

************************************************

#!/bin/bash

#SBATCH -J hello_world # Job name

#SBATCH -o hello_world.o%j # Name of stdout output file(%j expands to jobId)

#SBATCH -e hello_world.e%j # Name of stderr output file(%j expands to jobId)

#SBATCH –nodes=9 # Total number of nodes requested

#SBATCH –ntasks=9 # Total number of tasks to be configured for.

#SBATCH –cpus-per-task=1 # sets number of cpus needed by each task.

#SBATCH –tasks-per-node=1 # sets number of tasks to run on each node.

#SBATCH –get-user-env # tells sbatch to retrieve the users login environment.

#SBATCH -t 00:10:00 # Run time (hh:mm:ss)

#SBATCH –mem-per-cpu=1000 # memory required per allocated CPU

#SBATCH –partition=default_cpu # Which queue it should run on.

/bin/hostname # Name of MPI compiled executable/shell script.

************************************************

 

EX: job_submission_MPI_script.sh (this file must be executable e.g. permissions of 700 minimum)

************************************************

#!/bin/bash

/opt/openmpi/bin/mpirun -np 9 /home/maw349/mpi_hello_world

************************************************

 

Submit a job with the following:

$ sbatch –requeue job_submission_MPI_job.sub

Check if jobs are submitted by executing:

$ squeue -l

The output should be similar to:

[maw349@en-cluster07 mpi]$ squeue -l

Tue Mar 3 12:03:20 2020

JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)

852 default_c hello_wo maw349 RUNNING 0:01 10:00 9 coecis-3-[2-8],csl-10-[2-3]

Cancelling a job

To cancel or “remove” a job, you can use the scancel command and a Job ID number. To cancel or “remove” all of your jobs, use the scancel command and the “-u NetID”, where NetID is your Cornell NetID. For example, to cancel the job submitted in the above example:

[user@c4 ~]$ scancel 852

 

Design

The design of the magma cluster attempts to strike a balance between a private and shared compute cluster. The system uses Rocks on top of CentOS 7 with the Slurm scheduler.

The cluster consists of multiple groups of machines, each purchased by a research group, and a group of machines donated by COECIS. Every user in the cluster is affiliated with one of these groups. “Unaffiliated” users (i.e., those that haven’t purchased machines themselves) are actually affiliated with the COECIS group.

The Slurm scheduler allows us to configure the system to prioritize jobs on a machine based on the group that owns those machines and the user that is submitting a job. The scheduler may preempt a job in order to give a user access to their own machines. This is the “balance”: normally users can use machines in other groups, but if resources are strained, users are guaranteed access to their own group’s machines.

Normally, when the cluster is not fully utilized, a user could use their own machines and machines in other groups including the COECIS group, thereby enabling more jobs to be run and/or making their job run faster (presumably).

Because jobs may be preempted, it is strongly recommended that you use the –requeue switch on the sbatch submission command and to make your jobs use checkpoints. This should make your job able to be preempted without losing progress, if possible. A preempted job (if it was submitted with the –requeue switch) is stopped and rescheduled to start again (probably on another machine). We recommend setting checkpoints as they can save your progress and pick up where the job left off.

Preemption will only happen if jobs are submitted to a groups priority partition/queue and there are jobs from the default/interactive partitions on that groups compute nodes. Otherwise the system attempts to spread the load around. For example:

Let’s assume the following:

  • Bob is in research group Foo
  • Jane is in the COECIS group (her group has not donated any machines).
  • Jane is running a job on a bunch of machines, including some of the machines owned by group Foo. She used the –requeue switch when submitting her jobs.
  • Bob submits jobs to the priority partition for group Foo.

Since Bob submits a new job to his groups priority partition, Jane’s job running on machines owned by group Foo may be preempted, killed, and rescheduled if there are not enough resources for Bob’s job to run. Bob’s job would run on his group’s hardware, while Jane’s job would be put back into the queue.

 

Additional Documentation

Rolls – Rocks (http://www.rocksclusters.org/) installs packages (groups of related rpms) and Rocks specific configuration via a ‘Rolls’ mechanism. The rolls documentation tends to be informative, Rocks centric, but not always rigorous.

Slurm (19.05.2) – Rolls documentation for Slurm is brief. See the Users’ Manual section for user or job topics. In particular: