G2 FAQs – IT Support: Cornell Engineering, Bowers, Tech Campus

Contents

I cannot get logged into G2. Why not?

Question: I am using vscode to log in and it has always worked. It fails now.
Question: I am using x2go to log in and it has always worked. It fails now.
Question: I am logging in for the first time and it won’t let me in. Why not?
Answer:
- The only method of logging in we support is using ssh with your netid.
  - vscode logins will fail if your disk is full or you have exceeded your disk quota
  - x2go will fail if your disk is full or you have exceeded your disk quota.
  - If you are using ssh with your netid for the first time
    - Have you requested an account and received a “Welcome to G2” email?
      - if not, please request an account.
      - If so, please create a ticket for us to work on your issue.

Why is my job being killed?

Question: My jobs are being killed with the a message saying that it is out of memory but the server still has memory available. Why?
Question: My jobs gets killed after 4 hours of running. Why?
Question: My job fails when it cannot find a GPU. I know that the server that the job is assigned to has GPUs. Why can’t it use the GPUs?
Answer: By default, a job gets assigned 4 hours of runtime, 1 cpu, 1G of memory and no GPUs. If your jobs tries to use more cpu or memory than that, it will be killed automatically. You must request (when you submit the job) the maximum amount of resources that you job will need .
- Ex: If your job spends most of it’s runtime using 1G of memory, but occasionally spikes to 5G, you must request 5G of memory at job submission time.

My job uses /tmp (or /var/tmp) and fills it up.

Question: My job uses /tmp (or /var/tmp) and fills it up. This causes my job to fail and the job failing does not clean /tmp.
Answer: Create a tmpdir on your /share directory. Your program should use it instead of /tmp. You can define it in a couple of ways.export TMPDIR=/home/NetID/test
OR
TMPDIR=/home/NetID/app-name/tmp start_my_appTo test to see if it is doing it correctly, use the command:”mktemp -u”EX:
********************************************
root@slurmctld:~# mktemp -u
/tmp/tmp.sT4cVqCbTb
root@slurmctld:~# TMPDIR=/home/maw349/app-name/tmp mktemp -u
/home/maw349/app-name/tmp/tmp.Kf0cHlM955
root@slurmctld:~# mktemp -u
/tmp/tmp.OoPLxTHWf0
root@slurmctld:~# export TMPDIR=/home/maw349/test
root@slurmctld:~# mktemp -u
/home/maw349/test/tmp.eci5c6xSZt

How do I allocate memory (or other resources) for the job I am submitting to the cluster?

Question: I’d like to know how to claim runtime, memory, cpus or GPUs on the G2 cluster. For example, if my program uses 16GB of memory at peak, should I claim –mem=16000?
Answer: When submitting a job, memory is defined in MBs, not GBs. Therefore, if you know that your jobs memory usage will top out at 16GB, then you should request “–mem=16384” (16 x 1024). To request more than one CPU, use “-m 5” (for 5 cpus) and if you need to use a GPU, you should use “–gres=gpu:1” (for one GPU). To increase the amount of time that your job can use, use “-t 8:0:0” for 8 hours.

Why is my priority access to a compute node not working?

Question: I know that I have priority access to my groups compute node. No one else in our group has any jobs running, but I cannot get my job to run on our compute node. Why?
Question: I know that I have priority access to my groups compute node and I am submitting my job to the priority partition. But my job is set to pending. Why?
Answer: To get priority access to your compute node, you must submit your job to your priority partition, not the “gpu” or “default_partition” partitions. This is done with the “-p” switch. If you are submitting to the appropriate priority partition and your job is still pending, it may be because others in your group are using the node or it may be that you are asking for more resources than are available. To find out what resources others in your group are using on your priority compute node, get in touch with us. We will set it up so that you can see what others in your group are using.
- EX: If you are in Prof. Garg’s group, you should use the switch “-p garg” in your job submission script.

My job is dying without creating any log files

Question: I submitted a job and received a job-id. However, the job never ran and created no log files, even though I configured the job submission script to create them
- Answer 1:
  - you must have write permission on the directory where the files will be created and the directory must exist.
    - EX: If you give a path of “output/%N.o%j” the directory “output” must exist and you must have permissions configured to allow you to create files in it.
- Answer 2:
  - The first line of the job submission script must start with a line containing only “#!/bin/bash” (or whatever shell you want to use). It must be followed by all lines beginning with “#SBATCH” with no line spaces until all of the “#SBATCH” lines have been completed.

Can you help me understand SLURM preemption on G2?

There are three priority levels of partitions (see below). There are the default_partition, gpu and high-priority partitions. Each compute node owner has a high-priority partition. The fewer resources your job needs, the more likely it is to be submitted to a node and the less likely it is to be preempted. Jobs submitted to Priority partitions cannot preempt other jobs already submitted to the priority partition. You can only submit jobs to the priority partition of the faculty that you are associated with.
- All jobs, including interactive jobs, must be submitted to a specific partition (or queue).
  - Preemption order. Please submit your job to the lowest priority that is needed.
    - Low-Priority: For batch and interactive jobs requiring CPUs and/or GPUs, use the “default_partition” partition.
    - Medium-priority, GPU required: For batch and interactive jobs requiring GPUs, use the “gpu” partition.
      - This partition will preempt any jobs running on a GPU server which were submitted to the Low-Priority partition.
    - High-priority: for batch and interactive jobs requiring CPUs and/or GPUS, use the priority queue that belongs to your group.
      - Only the servers owned by the faculty to whom this priority partition belongs will be available through these partitions.
      - This partition will preempt any jobs running on any server (owned by faculty to which it belongs) which were submitted to the Low/Medium-Priority partitions.

How many GPUs can I ask for in my job?

Unless you are using MPI, your job cannot ask for more GPUs that any one server in the cluster has (which is 8). When you submit a job asking for just one GPU, your job can run on any server that has enough resources (cpu and memory) plus one GPU. This gives you a potential pool of 60+ servers that your job can run on and frequently, many of those 60 servers have one GPU available. When your job asks for more resources, such as 4 GPUs. Then it can run on any server that has enough resources, plus 4 GPUs free. While this is still a potential pool of 50+ servers (not all servers have 4 physical GPUs), a very small percentage of the servers will actually have 4 GPUs unallocated so the chances of your job being run in a timely manner is much smaller than if you submitted 4 separate jobs, each asking for one GPU. If you ask for even more GPUs (say 6 or 8), then the pool of available servers is reduced to 25+, as only a small percentage of servers have enough GPUs to be able to service your request, and very few of them will have that many GPUs available in a timely manner. What I am trying to say is that the more resources you request per job, the less likely it is that your job will be able to run right away. If you ask for the max resources available on any one compute node, it may take weeks before your job is run. If you ask for more resources than is available on any compute node, it will not even submit the job.

Why won’t my job run on a node that I know has a free GPU?

For any job to be submitted to a compute node, the compute node must have at least as many GPUs, CPUs and GB of memory available for use as your job requests. The scheduler does not know how much of any of these resources are actually in use. It does it’s scheduling by calculating what each job asks for and look for compute nodes that have at least that much resource un-scheduled. If a compute node does not have any one of these resources unscheduled in sufficient quantity, the job will not be scheduled to that compute node. For Example: If you ask for a specific node, 6 CPUs, 8 GPUs and 240GB of memory and the compute node has 8 CPUs, 8 GPUs and 220GB of memory, the job will not be assigned to that node even though it has enough CPU and GPU, because it does not have enough memory.

How can I request a node based on pre-defined Features, such as CPU type, GPU type, etc?

Question: How can I request a server containing a GPU based on the amount of memory it has?
Question: how can I request a server based on the type of CPU it has
Question: How can I request a server based on if it has GPUs connected with NVlink Bridge

Answer:

G2 nodes have certain Features defined to allow a selection of nodes with identical features. Insert these Features into the “srun/sbatch” commands as demonstrated below.
- Please utilize the Features that most closely fit your job needs.
- If you don’t care which node that you job runs on, don’t add any Features.
- Available Features are:
  - gpu-low # request server with GPUs of 19G of GPU memory or less
  - gpu-mid # request server with GPUs of 20G to 39G of GPU memory
  - gpu-high # request server with GPUs of 40G of GPU memory or more
  - gpu-fp64 # request server with GPU(s) with high fp64 capability
  - nvlink # request server with at least one set of 2 GPUs linked by an nvlink-bridge.
  - “GPU Model” # select one of the following Features to request a specific GPU model
    - gpu-low
      - titanx # 12G GPU memory
      - titanxp # 12G GPU memory
      - 1080ti # 11G GPU memory
      - 2080ti # 11G GPU memory
      - t4 # 15G GPU memory
    - gpu-mid
      - titanrtx # 24G GPU memory
      - r6000 # 22G GPU memory
      - l4 # 24G GPU memory
      - 3090 # 24G GPU memory
      - a5000 # 24G GPU memory
      - a5500 # 24G GPU memory
      - v100 # 32G GPU memory
    - gpu-high
      - a40 # 48G GPU memory
      - a6000 # 48G GPU memory
      - a6000ada # 48G GPU memory
      - a100 # 40/80G GPU memory
      - h100 # 94G GPU memory
  - “CPU Model” # select one of the following to request a specific CPU model
    - sb # SandyBridge CPU
    - ib # IvyBridge CPU
    - ha # Haswell CPU
    - bw # Broadwell CPU
    - sl # SkyLake CPU
    - cl # CascadeLake CPU
    - il # IceLake CPU
    - sr # SaphireRapids CPU
- Multiple Features can be added with a combination of the “|” (logical OR between two or more selected Features) or “&” separator (Logical AND of the two or more selected Features).
  - EX: Add the following switch to your:
    - srun command
      - To select servers with one Feature from the above list
        
        –constraint=”gpu-high”
      - To select servers with one of either of two Features from the list above.
        
        –constraint=”gpu-mid|gpu-high”
      - To select servers with two Features from the above list
        
        –constraint=”gpu-low&gpu-fp64″
      - To select servers with two Features from the above list. Logical OR (gpu-mid|gpu-high) combined with Logical AND
        
        –constraint=”(gpu-mid|gpu-high)&nvlink”
    - sbatch job submission script.
      - To select servers with one Feature from the above list
        
        #SBATCH –constraint=”gpu-high”
      - To select servers with one of either of two Features from the list above.
        
        #SBATCH –constraint=”gpu-mid|gpu-high”
      - To select servers with two Features from the above list
        
        #SBATCH –constraint=”gpu-fp64&gpu-high”
      - To select servers with two Features from the above list. Logical OR (gpu-mid|gpu-high) combined with Logical AND
        
        –constraint=”(gpu-mid|gpu-high)&nvlink”

How do I use Visual Studio Code to open Jupyter Notebook running on G2?

Start a notebook server in G2
- From the local computer
  - connect to the g2-login node using SSH
    - ssh NETID@g2-login.coecis.cornell.edu
      - NETID = Your NetID
- From the g2-login node
  - connect to the g2-login node using SSH
    - ssh NETID@g2-login.coecis.cornell.edu -L PORT:NODE:PORT
      - NETID = Your NetID
      - PORT = we recommend using a free port between 8000 and 9000
      - NODE = the name of the node you plan to run jupyter notebook on
        
        A list of nodes in the cluster can be found using the command
        
        sinfo -N | awk '{print $1}' | sort -u
- From the g2-login node
  - start an interactive job (an example is below)
    - srun -p PARTITION-interactive --mem=100gb -n 8 --pty /bin/bash
      - PARTITION = the account you are a member of
        
        A list of partitions in the cluster can be found using the command
        
        sinfo -s | awk '{print $1}' | grep interactive | sort -u
- From the NODE node
  - Define a temporary directory that will be used to store files while the jupyter notebook server is running
    - XDG_RUNTIME_DIR=/tmp/NETID/ jupyter-notebook --ip=0.0.0.0 --port=PORT
    - Some faculty may have other procedures they would like you to follow
      - /share/pierson/tmp_directory_location_please_read_readme
        
        NETID = Your NetID
        
        PORT = the port you defined when you created the SSH tunnel from the g2-login node to itself
        
        if the command jupyter-notebook is not found,check to see if it is in your path using the command
        
        which jupyter-notebook
        
        if anconda is in you path, you can set it using the command
        
        export PATH=/share/apps/anaconda3/2022.10/bin:$PATH
- If successful – you will be presented with a few URLs to the notebook server that look similar to this
  - To access the notebook, open this file in a browser:
    file:///home/netid/.local/share/jupyter/runtime/nbserver-449894-open.html
    Or copy and paste one of these URLs:
    http://klara.tech.cornell.edu:8765/?token=635ccdf0cc8a94472dc3c1d257bd698a020466c2631adf4f
    or http://127.0.0.1:8765/?token=635ccdf0cc8a94472dc3c1d257bd698a020466c2631adf4f
Connect to the notebook server using Visual Studio Code
- From the local computer
  - Connect to the g2-login node from within vscode using remote-SSH
  - Open a Jupyter notebook
  - Click “Select kernel” in the top right corner
Open the Jupyter Notebook stored on the g2-login node
- Click “Select a Jupyter server”
- Select “Enter the URL of the running Jupyter server”
  - Depending on what you have already done in vscode – you may need to install some packages (jupyter + python extensions, remote-SSH, etc.)
- Enter one of the URLs from above
- Port forwarding should now be set up to the G2 login node, not to the local computer

How do I check the amount of memory available on a compute node?

From a login node run the following command

sinfo -p [partition] -O NodeList,AllocMem,Memory (where partition = kim, sablab, etc …)

If I specify a memory allocation of 128GB in a slurm submission script but also use the –exclusive flag in the script will the job have access to all memory on the node or be restricted to 128GB?