Skip to end of metadata
Go to start of metadata

Accounts and logging in

How do I request an O2 account?

The prerequisite for obtaining an O2 account is an HMS eCommons account, as O2 uses eCommons credentials for cluster authentication. You can request an eCommons account at https://ecommons.med.harvard.edu if you do not already have one. Requesting an O2 account can be done using the "Account Request" form available at: https://rc.hms.harvard.edu/#cluster. You will receive an email notification once we have created your account.

I have an O2 account. How do I login to the O2 cluster?

You can connect to O2 using ssh (secure shell) at the hostname: o2.hms.harvard.edu. If you're on Linux or Mac, you can use the native terminal application. If you're on Windows, you will need to install a program to connect to O2; we recommend MobaXterm. In either terminal or MobaXterm, type the following command:

ssh yourecommons@o2.hms.harvard.edu

where you write instead of yourecommons you write your eCommons ID (something like js123 if your name is John Smith). Make sure your eCommons ID is in lowercase. You will be prompted to enter your eCommons password. Once you authenticate, you'll be on one of the O2 login servers.

For more details on how to login to the cluster, please reference this wiki page.

I can't login to O2!

All cluster logins from outside the HMS network require two-factor authentication. For more details, please reference Two Factor Authentication on O2 and Two Factor Authentication FAQ. Please contact us if you are having trouble with two factor authentication on O2. 

Please do NOT send us or anyone else your password. Ever. We can assist you without knowing your password, and sharing accounts on the cluster is prohibited by Harvard security policy.

If you're having difficulty logging in to O2, make sure you're using your eCommons ID (in lowercase) and eCommons password. If that does not resolve the problem, then try logging in to https://ecommons.med.harvard.edu. Contact the IT Service Desk (itservicedesk@hms.harvard.edu, or 617-432-2000) if you're unable to login to the eCommons website. Your eCommons may have been locked due to too many failed login attempts to the O2 cluster. Once you are able to login to eCommons, wait 1 hour and try logging in to O2 again. If you're still facing problems, then send in a ticket to us.

Files, Storage, Quotas

Where can I put my data?

There are several different filesystems that each researcher will have access to. See Filesystems which starts with a basic rundown of where you would want to put which kind of information.

Are my files automatically backed up?

It depends. See Filesystems. Temporary filesystems (like the scratch filesystem in /n/scratch3, or /tmp, which is a hard drive on individual compute nodes) are not backed up, and are occasionally purged of data. We strongly encourage you to use a backed-up filesystem for important data. Don't store the only copy of your data on your desktop unless it is reliably backed up. 

Help! I deleted a file/directory/thesis!

See the Restoring backups sections of Filesystems. As that section describes, IF the data was on a backed-up filesystem, you can actually restore the data yourself. If you run into trouble, contact Research Computing, and we'll do our best to help you. We strongly encourage you to use a backed-up filesystem for important data. For example, even Research Computing has no way to restore deleted data on the scratch3 filesystem.

How do I get data to/from O2?

See File Transfer.

How much can I store on O2?

It depends. See Filesystem Quotas.

Starting Jobs

I just want to run a job!

If you want to run a program called Analyze that you would run like this from the command line:

Analyze -i input.fasta -o output.txt

then to run it on the cluster you would need to create an sbatch script to submit it as a batch job.

For example, an sbatch script called analyze.sh contains:

#!/bin/bash
#SBATCH -p priority
#SBATCH -t 0-1
#SBATCH -o analyze.%j.out
#SBATCH --mem 2G 
#SBATCH -c 1

Analyze -i input.fasta -o output.txt

The job will be run in the priority partition for one hour, using 1 core, and 2G of memory. The output for the job will go to a filename called analyze.%j.out, where %j will be replaced with the job ID.

Note: A job gets 1 GB memory if you don't explicitly ask for more (or less), and 1 GB is plenty for many applications. Your job will start faster the less memory you ask for. So only ask for extra memory if you need it – i.e., if you run a job that dies with an error that it went over the memory limit.

You submit the sbatch script to the Slurm scheduler by:

sbatch analyze.sh

Please reference the Using Slurm Basic page for a longer introduction to what it means to submit jobs to O2.


If you want to debug or compile code, where you'll need to run a bunch of different programs one at a time, the fastest way to get started is to request an interactive job:

$ srun --pty -p interactive -t 0-2 --mem 2G bash

This will start an interactive job with a two hour time limit. From here, you can compile applications or run programs. Note that there are still limits on the amount of CPU or memory resources available to you. Your job will be limited to the actual number of core(s) you request; unless specified, a job will be allocated 1 core by default. Additionally, your job will be killed if you try to use more memory than what you have requested.

How do I choose which partition to run in?

See How to choose a partition in O2.

There are thousands of jobs PENDING (or PD) in a partition. Will my job take forever to start?

Probably not, though the dispatch time of the job depends on the job priority, the resources you've requested, and the current availability of cluster resources. For example, if the cluster is very busy and you need 250 GB of RAM for your analysis, your job will pend for a while, as you're essentially asking for a whole compute node to be empty. You can reference the Job Priority page, which details the six factors contributing to a job's priority. The most important factors for the average O2 user are: age (increases the longer your job sits in the queue), partition (jobs in interactive and priority partitions will be dispatched first), and fairshare (tracks the resource you recently used and compares those with the fair share of computational resources available for each user). Your fairshare will deplete with more usage, but will fully rebound within two days of no cluster usage. When the cluster is busy, your jobs will pend longer the lower your fairshare is. 

Short vs. Long jobs: Is it better to run 48 separate 30 minute jobs or 1 single 24 hour job?

The best job submission strategy to choose depends on many factors. For example, if each of the 48 jobs requires multiple CPUs (>2) and a large amount of memory (>20G - 40G), then a single longer job is preferable to 48 shorter jobs. If each job only requires 1 CPU core and 1G or less of memory, then running the jobs separately will usually be faster. If you have any questions about optimizing your workflows, then please contact us!

My O2 jobs are very important. How can I guarantee that I will be able to run them when I need to?

Please contact us and we'll be happy to work with your needs.

How am I supposed to know how long my job will take?

You can use the O2sacct command to get detailed information on resource usage of your completed jobs, including how long the job took. See here for an introduction on using O2sacct.

By running test versions of your workflow (to make sure that the process is correct), you can get a sense of how long the full workflow will run. Remember that jobs can die for a variety of reasons so it's always best to design your workflow so you can quickly recover if it gets interrupted. Please contact us if you would like help with this. Especially if you are just running something once, it's fine to overestimate the runtime limit.

How long should my job(s) take?

The O2 cluster is not designed to work with extremely short jobs (<1 minute). The minimum run time that you can aim for is ~10-15 minutes, based upon the scheduler and the cluster's underlying configuration. If you have a large batch of very short running jobs, the time to process the job submissions will be substantially longer than actually running the jobs, and this may slow the cluster down for everyone. You can write a script to batch sets of jobs together. Please contact us if you want assistance with this process. Another option is to start a session in the interactive partition, and run the many short running jobs in that session.

My job has to run on a node that has 16GB of memory free. How can I make sure it goes to the right node?

Use the --mem parameter in your job submission command with the amount of memory you want to request. See the Using Slurm Basic page for more information.

Problems with jobs

Why hasn't my job started yet? Why has it been in PEND state for so long?

Using O2squeue followed by the jobid will give you an expected start time of your job (START_TIME column) and the reason why your job is pending ( in the NODELIST(REASON) column). See this page for information on using O2squeue. For further job troubleshooting tips for pending jobs, please refer to the "Slurm Job Reasons" and the "Jobs that never start" section of the Troubleshooting Slurm Jobs page.

Why did my job exit before it finished the analysis?

Probably because it ran too long or used too much memory. If your job runs longer than the runtime limit you give with -t, it will be killed with the TIMEOUT state. This can also occur if your job uses resources incorrectly (e.g., a multi-threaded job that doesn't use sbatch -c, or a badly-behaved Matlab job). See the "Exceeded run time" section of Troubleshooting Slurm jobs for more information on avoiding this error. 

If you use more memory than you reserved (or more than the default memory of 1G, if you didn't explicitly ask for a certain amount) then your job will be killed with the CANCELLED error. In output from the sacct command, this job will be seen as CANCELLED by 0, which differentiates it as a job killed by the scheduler instead of a job killed by the user (will see another number representing your user id instead of 0 in this case). See the "Exceeded requested memory" section of Troubleshooting Slurm Jobs for more information.

What does "oom-kill event" mean?

If you're seeing that in a job output, that means your job was killed because it exceeded the memory allocation you requested. Your job should also be in OUT_OF_MEMORY state, which hopefully is fairly clear. Simply request more memory (with --mem or --mem-per-cpu), and eventually your job will complete successfully (or you'll run out of available memory to request, in which case you should contact Research Computing for next steps). See "Exceeded Requested Memory" section of Troubleshooting Slurm Jobs for more information about this specific error.

Why am I getting a "permission denied" error on a previously writable directory? OR Why have all my jobs since a certain time failed when they used to run fine?

Either of these problems can be due to going over the 100G quota in your home directory, or over the set quota in a shared group directory. Please read here for more information.

Why are jobs that exceed a partition's time limit killed instead of just being moved to a partition with a longer time limit?

If we moved jobs that exceeded a time limit, users could inappropriately take advantage of the scheduling system by always submitting to the shortest time-limited partition to get quick job execution, after which their long-running jobs would cascade through the partitions with longer limits.

An interactive job works, but running it as batch doesn't

This is likely because you're using sbatch --wrap, or sbatch without writing a script. The --wrap option is not failproof; more complex commands, such as those that use |, are not interpreted correctly when using this job submission method. We recommend that you package up the commands as a script and submit the script using sbatch instead. See the "Submitting Jobs" section of the Using Slurm Basic page for more information on sbatch scripts.

Why can't I plot to files or use graphical user interfaces in my job?

If you are plotting (e.g. in R or Matlab) or trying to use a graphical user interface (GUI) on the cluster, you must set up an X11 session. This is a multi-step process that involves running an X11 server on your desktop/laptop, connecting with ssh -XY, and using either srun --x11 or sbatch --x11-batch in your job submission command. Additionally, you must authenticate for the job to be dispatched appropriately. For interactive jobs, you can type your password, but for batch jobs, you must configure ssh keys prior to job submission. See here for instructions on creating ssh keys.

Please keep in mind that we do not recommend using GUI applications (like the graphical mode of Matlab) on the O2 cluster, as you can experience laggy performance as graphics are forwarded over X11 in real time. 

Specific Programs or Programming Languages

Can I run Matlab on O2?

Yes! Matlab in particular is so popular that we have a whole separate page for Using MATLAB on O2. As a brief summary:

- You can run Matlab in graphical mode (editing your program, graphing, etc.) or batch mode (simply running a script)
- Many Matlab programs written on the desktop can be copied directly to the cluster and work with minimal or no changes
- O2 has many Matlab toolkits available
- O2 is particular useful if you want to run jobs that require a lot of memory (RAM) or processing power. Many jobs can be split into pieces and run in parallel on O2 for a substantial speedup

Can I use RStudio on O2?

Currently RStudio is only available on O2, under the BioGrids module, as a remote GUI console using X11 forwarding. We have noticed growing user interest in such a service, and in the future we plan on implementing a RStudio Server or similar type of service.

Can I run Jupyter notebooks on O2?

Yes, we have instructions for setting up Jupyter notebooks here. However, note that this process can be prone to failure. We are investigating the feasibility of offering a more robust solution in the future.

How do I run a particular version of Matlab, Java, Perl, Python, R, or some bioinformatics program?

Many programs have multiple versions installed. See which versions of Java, R, or STAR are available with commands like module spider java or module spider R or module spider star, for example. See Using Applications on O2 for more detail.

How do I get a library for R, Perl, Python, etc.?

We might already have it under a different version of the language. (See Using Applications on O2 for more detail.) You can also install R, Perl, or Python packages in your directory. See Personal R PackagesPersonal Perl Packages, and Personal Python Packages.

Can I do deep learning/machine learning/GPU analyses on O2?

Yes, please reference this page for more information on GPU resources on O2.


  • No labels