Using the Magma Compute Environment

Introduction

  • magma.coecis.cornell.edu is a RocksCluster (V6.1.1) using the HTCondor (V8.0.6) scheduler running on RedHat 6.6 OS.  There are currently ~532 CPUs total in the magma cluster.
  • In this cluster, participating groups’ nodes are added to both member and private preemptive queues.  All participating member nodes are available for use by the member community; however, an individual group’s jobs have immediate preemptive priority reclaiming their node to their priority (private) queue as needed if in use by others.
  • If you are a member of a research group that has MAGMA research nodes, you may request an accounts via the help-ticket system.  Please explicitly include your research group information and cc your PI with the request.
  • Users log in with their Cornell net-id credentials.
  • Please assume that there are no backups of any data.  Some research groups may subscribe to EZBackup but most don’t.
  • All jobs are submitted via the HTCondor a specialized workload management system for compute-intensive jobs.  This is a batch system and there are NO interactive queues.  Interactive jobs will compete with scheduled jobs on compute nodes resulting in node paging and/or crashing.

User Access

File Systems

  • Users home directories may live on project or shared NAS’s.  These file systems are shared via NFS and are NOT backed up.  Jobs that have ‘expensive’ I/O should limit read/wrtite traffic to the local compute node file systems when possible.  There are built-in mechanisms in Condor to make the process easier (should_transfer_files, when_to_transfer_output, etc.).
  • Local compute-node file systems accessible to users include /tmp & /state/partition1.  /tmp (100GB/node) is for temporary run-time data and will be cleaned out after a reboot or after the data becomes older than ~30 days.  /state/partition1 is a place to store data that is used more than once.  The data will survive most reboots unless the whole node, including the file system, needs to be rebuilt.  The /state/partition1 on compute-nodes are often larger than 500GB.  As this is a shared file system, clean up after use, write data into folders – not the top level of the directory.
  • /share/apps is where applications are installed.  /share/apps is available on all the compute nodes, Matlab is installed in /usr/local/MATLAB as the cost for 200 licenses is approximately $36,000/year.
  • Research groups may also have a shared directory available as /share/research_group_name.

Where is MATLAB and how do I use it on compute nodes?

  • Matlab is installed in en-cluster03:/use/local/MATLAB and it isn’t available directly on the compute nodes.  Our recommendation and best practice for Matlab cluster use is to employ the MATLAB Compiler http://www.mathworks.com/products/compiler/  This allows users to build Matlab executables and then later run the executable without having Matlab locally installed.  This is the case with a cluster environment where installing Matlab for every node is prohibitively expensive. The MATLAB compiler is included as one of the standard tool boxes that Cornell purchases.

Can I get Scipy (or some other tool) installed?

  • Python based tools are installed in /share/apps/epd.  Epd is a commercial implementation of python with various packages from Enthought.  Other packages that would help you with your research and is available via academic licensing, open a help-ticket and we’ll work on getting it installed.  We can also install commercial software and the process is still the same, please open a help-ticket so that we can work on your request.

Usage

Job files

Copy over test files into a new directory.

$ mkdir test

$ cd test

$ cp /opt/condor/tests/hello.* .

Submitting a job

Note that jobs need to be submitted from the head node.  Submit a job with the following:

$ condor_submit hello.sub

Check if jobs are submitted by executing:

$ condor_q

The output should be similar to:

[mjb43@en-cluster03 test]$ condor_q

-- Submitter: en-cluster03.coecis.cornell.edu : <132.236.91.36:46424> : en-cluster03.coecis.cornell.edu
ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
1162.0   mjb43           5/13 15:44   0+00:00:00 I  0   0.0  hello.sh

1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

R in status column(ST) means running. I means idling.

Cancelling a job

To cancel or “remove” a job, you can use the condor_rm command and a Job ID number. For example:

[user@c4 ~]$ condor_q

-- Submitter: c4.coecis.cornell.edu : <132.236.91.x:45110> : c4.coecis.cornell.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
1167.0   user            7/17 12:58   0+00:00:00 I  0   0.0  hello.sh
1168.0   user            7/17 12:58   0+00:00:00 I  0   0.0  hello.sh
1171.0   user            7/17 12:58   0+00:00:00 I  0   0.0  hello.sh
1172.0   user            7/17 13:00   0+00:00:00 I  0   0.0  hello.sh

4 jobs; 0 completed, 0 removed, 4 idle, 0 running, 0 held, 0 suspended
[user@c4 ~]$ condor_rm 1167.0
Job 1167.0 marked for removal
[user@c4 ~]$ condor_rm 1168.0
Job 1168.0 marked for removal
[user@c4 ~]$ condor_rm 1171.0
Job 1171.0 marked for removal

Restricting a job to run on certain hosts

If you want or need to run your job on only certain hosts (e.g. if your job cannot be preempted or cancelled, but takes a long time), you may do so in two ways. Both involve modifying your job submission file:

  1. You can cause the job to run only on the nodes belonging to a certain group. You do this by adding Requirements = TARGET.C4_GROUP == "group name" to your job submission file.
  2. You can cause the job to prefer nodes belonging to a certain group. You do this by adding Rank = TARGET.C4_GROUP == "group name" to your job submission file.

In both of the above methods, you use a C4 group’s name to indicate which group of nodes are required or preferred. Keep in mind that if the cluster is full, your jobs may be preempted if they are running on a different group’s nodes.

Viewing status

Other Useful Condor Commands:

Display information about jobs in the queue.

$ condor_q -pool en-cluster03

Checking the status of the job queues:

$ condor_status

Checking the history of jobs that have run:

$ condor_history

Checking to see which compute node a job is running on:

$ condor_status -run

See the additional documentation at the end for more detailed docs.

Design

The design of the magma cluster attempts to strike a balance between a private and shared compute cluster. The system uses Rocks on top of Red Hat Enterprise Linux with the HTCondor scheduler.

The cluster consists of multiple groups of machines, each purchased by a research group, and one group of machines donated by COECIS. Every user in the cluster is affiliated with one of these groups. “Unaffiliated” users (i.e., those that haven’t purchased machines themselves) are actually affiliated with the COECIS group.

The HTCondor scheduler allows us to configure the system to prioritize jobs on a machine based on the group that owns those machines and the user that is submitting a job. The scheduler may preempt a job, if necessary, if the cluster is full, in order to give a user access to their own machines (including unaffiliated users with COECIS machines). This is the “balance”: normally users can use machines in other groups, but if resources are strained, users are guaranteed access to their own group’s machines.

Normally, when the cluster is not fully utilized, a user could use their own machines and machines in other groups including the COECIS group, thereby making their job run faster (presumably).

Because jobs may be preempted, it is strongly recommended to make your jobs use checkpoints and be able to get preempted without losing progress, if possible. A preempted job is stopped and rescheduled to start again (probably on another machine). Checkpoints can save your progress and pick up where the job left off.

Preemption will only happen if the cluster is full. Otherwise the system attempts to spread the load around. If the cluster is full, however, preemption kicks in. For example:

Let’s assume the following:

  • Bob is in research group Foo
  • Jane is in the COECIS group (her group has not donated any machines).
  • Jane is running a job on a bunch of machines, including some of the machines owned by group Foo.
  • The cluster is 100% used.

Since the cluster is full, if Bob submits a new job, Jane’s job may be preempted, killed, and rescheduled. Bob’s job would run on his group’s hardware, while Jane’s job would be put back into the queue.

Please note that the opposite may also happen: if Bob were running a job on the COECIS machines, his job may get preempted if Jane submits a new job while the cluster is full.

Additional Documentation

Rolls – Rocks installs packages (groups of related rpms) and Rocks specific configuration via a ‘Rolls’ mechanism.  The rolls documentation tends to be informative, Rocks centric, but not always rigorous.

HTCondor (7.8) – Rolls documentation for HTCondor is brief.  See the Users’ Manual section for user or job topics.  In particular: