Skip to end of metadata
Go to start of metadata

This page shows all service outages for the O2 cluster, including planned maintenance and unplanned events.

We also post updates on the HMS RC Twitter page.

DEGRADED PERFORMANCE

September 18 - Service Degradation

O2 has been experiencing intermittent problems with its authentication system on login, transfer, and compute nodes. This issue can potentially result in: slow or failed logins to O2, Missing group membership, Failed job submissions. We are working with the software vendor to resolve this.

August 9 - Service Degradation

New jobs are intermittently not starting on the cluster (or the sbatch command has errors) due to an issue with cluster-storage communication. We believe that currently running jobs are still executing normally. Disk read/writes may be slower than usual, which can cause other commands to be slow. We will provide details as we get them.


July 8: notes after the July OS/Slurm update

  • Jupyter Notebooks users should start a new environment and remove any old runtime directories.
  • "sbatch" no longer uses the "–x11" option in the new version of Slurm. Just remove it from your script and X forwarding should work by default.
    • "srun" commands still require "–x11" to enable X forwarding, though.
  • If you have any custom built software, you may need to recompile or relink it on O2.


Two Factor Authentication:

All O2 cluster logins from outside of the HMS network require two-factor authentication. Please see:

Scheduled Maintenance and Current Outages:

DateServiceIssue
2020-11-14/n/files

To improve performance and keep our storage systems updated, HMS IT will migrate data on the research.files.med.harvard.edu server to a new storage array.

Outage window: Saturday, November 14, 2020, from 8:00 AM to 8:00 PM

  • This will only affect the O2 filesystem: /n/files

which is only accessible from the transfer servers (transfer.rc.hms.harvard.edu) and transfer compute nodes.

2020-08-09O2 cluster

This issue may still occur sporadically and isn't yet considered fully resolved:

--

New jobs are intermittently not starting on the cluster (or the sbatch command has errors) due to an issue with cluster-storage communication. We believe that currently running jobs are still executing normally. Disk read/writes may be slower than usual, which can cause other commands to be slow. We will provide details as we get them.


Previous Service Outages:


DateServiceIssue
2020-09-26O2 cluster

On Saturday, September 26, 2020, from 6 AM to 1 PM EDT, HMS IT will be completing a strategic network upgrade which will increase the HMS campus internet connectivity from 40 to 100 gigabits per second. This upgrade improves support for data-intensive science, online education, and remote work.

The O2 cluster will remain fully operational. However, there is the potential for issues related to O2’s authentication service during the maintenance. This could result in any of the following issues:

  • Difficulty logging into O2
  • Authentication timeouts
  • New job submissions could be slow or may fail

Jobs which are already running are expected to continue without any problems.

2020-09-18O2 authenticationIntermittent problems with authentication for O2 login, transfer, and compute nodes.
This issue can potentially result in: Slow or failed logins to O2, Missing group membership, Failed job submissions
2020-08-26O2 cluster

HMS IT will be performing minor maintenance on the O2 cluster which is expected to improve the responsiveness of the SLURM job scheduler (see outage notes for 8/9/2020)

MAINTENANCE WINDOW:

  • Wednesday, Aug 26, 8:00 am– 9:00 am

IMPACT: 

  • You will not be able to submit new jobs to O2 during this time.
  • SLURM commands (squeue, sbatch, srun, etc. ) will likely fail with a timeout error.
  • Already running jobs will continue to run normally.
2020-07-30Full O2 cluster

Unplanned SLURM outage, due to unbalanced file system allocations on a primary storage cluster. Service restored 3pm

2020-07-29 → 2020-07-30

/n/no_backup2

Scheduled Maintenance window: 2020-07-29 5:00 PM to 2020-07-30 5:00 PM

HMS IT will be migrating data from /n/no_backup2 to a newer filesystem.

2020-07-07Full O2 cluster outage

Scheduled Maintenance window: All day on July 7:

  • Tue July 7, 12:00am - 11:59pm

Actual Maintenance window: 5.00 am - 11.45 pm

Once the upgrade is completed on Tuesday evening, all O2 services will become available.

O2 will be completely offline to allow for an update to the Linux operating system (to CentOS 7.7) on all cluster systems, as well as an update to the Slurm job scheduler (to version 20.02).

These are standard maintenance and security updates. No changes are expected from a usability perspective to O2 or its installed software (e.g. modules).

Impact:

  • Logins to O2 (o2.hms.harvard.edu) will be unavailable.
  • O2's job scheduler will be down. Jobs will not run and new job submissions will not work until after the work is completed.
  • Logins to the transfer servers (transfer.rc.hms.harvard.edu) will still be available to access all data on O2, after the morning's storage maintenance is completed (see above).

Websites hosted by HMS Research Computing will not be affected unless they run jobs on the cluster, since job submissions will be unavailable.


2020-07-07/home data and logins to O2 transfer servers

Scheduled Maintenance window:

  • Tue July 7, 07:30am - 10:00am service restored 1:30pm

The /home filesystem may be unavailable during this window due to planned storage maintenance.

While the O2 cluster will also be offline all day on July 7 (see below), logins to the transfer servers at transfer.rc.hms.harvard.edu will still work, so research data will be accessible.

However, this separate storage maintenance will result in /home being unavailable during the 7:30 - 10am window, which could disrupt logins.

2020-07-03/www data and websites hosted by Research Computing

Scheduled Maintenance window: 4:00pm - 6:00pm

Actual Maintenance window: 4pm - 6.30pm

HMS IT will be performing maintenance on the /www filesystem which will result in an temporary outage of websites and any cluster jobs which access data under /www

Websites hosted outside of Research Computing, such as through
WARP, OpenScholar, or HMS Windows web hosting, will not be affected.

2020-06-27HMS IT will make upgrades to the high-throughput research network that may sometimes block access

Scheduled Maintenance window: 6:00am - 1pm

Actual Maintenance window: 6:00am - 12pm

HMS IT will make upgrades to the high-throughput research network that may sometimes block access between O2 and all external networks, including the HMS Quad, all Harvard networks, and the internet.

Note that the actual outage may end sooner than 1pm depending on the day's progress.

Impact:

  • Batch jobs which are already running on O2 will continue to run normally, except:
    • jobs which rely on connections to external networks (e.g. to download data) will be affected during the outage.
  • Jobs in a PENDING state will remain pending until after the outage is complete.
  • Interactive jobs and active logins to O2 and transfer servers will be killed.
  • Websites hosted by Research Computing on infrastructure in the data center (which is most of them) will be inaccessible.
2020-06-26/n/scratch2 goes offline

The /n/scratch2 filesystem is being taken offline and retired.

Any data left on /n/scratch2 will be LOST and NOT RECOVERABLE.

All users of scratch space must switch their workflows to the new filesystem under /n/scratch3/users . More details at: Scratch3 Storage

2020-06-15/n/scratch2 becomes READ-ONLY

The /n/scratch2 filesystem will made READ-ONLY in preparation for its retirement on June 26.

All users of scratch space must switch their workflows to the new filesystem under /n/scratch3/users . More details at: Scratch3 Storage

2020-05-16Network connectivity between O2 and networks outside out the HMS data center.

Scheduled Maintenance window: 5:30am - 1pm

Actual Maintenance window: 5.30am - 10 am

A planned upgrade to the HMS interior firewall will result in an outage between O2 and all external networks, including the HMS Quad, all Harvard networks, and the internet.

Note that the actual outage may end sooner than 1pm depending on the day's progress.

Impact:

  • Batch jobs which are already running on O2 will continue to run normally, except:
    • jobs which rely on connections to external networks (e.g. to download data) will be affected during the outage.
  • Jobs in a PENDING state will remain pending until after the outage is complete.
  • Interactive jobs and active logins to O2 and transfer servers will be killed.
  • Websites hosted by Research Computing on infrastructure in the data center (which is most of them) will be inaccessible.
2020-04-13/n/app

Maintenance window: 6:00am - 10:00am

The filesystem /n/app , which is used to host scientific software applications on O2, will be migrated onto newer, more performant storage.

  • HMS DevOps and Research Computing have tested this change in a development environment and do not expect it to affect jobs on O2 unless they are trying to directly access /n/app (e.g. to reload a Module).
  • As a precaution, ALL new jobs submitted during this time window will remain pending until after the work is completed. This includes both batch and interactive jobs.
  • Please plan accordingly. If you are very concerned about the robustness of your job to this change, we encourage you to make sure jobs finish before this time, and then wait to submit new ones until after the change.
2020-03-29

O2 cluster

/n/data2

/n/groups

Maintenance window: 3.30pm - 7pm

High load on one of the storage servers that is known on cluster as /n/data2 and /n/groups,

Impact:

  • Logins to O2 cluster and transfer nodes
  • Intermittent issues with data access.

The issue was resolved after the high load processes finished.

2020-02-27O2 Cluster

The O2 job scheduler became unavailable due to an unforeseen bug in the scheduler control process.

The problem was resolved with a patch applied to the scheduler software.

2020-01-12O2 Cluster

Maintenance window: 4am - 12pm (noon)

Network maintenance being performed in the HMS data center will result in outages of 1-3 minutes on the O2 network.

Impact:

  • To minimize the possibility of job failures, we will pause all jobs on O2 during this maintenance, and resume the jobs after the maintenance is complete.
  • O2 logins should still work, at least intermittently, during the maintenance. Any new jobs submitted during this period will remain pending until after the maintenance is complete.

This work over Jan 11-12 is being done to increase network performance in the HMS data center. After completion, all HMS systems hosted in the data center (including O2, storage, virtual machine infrastructure) will be running on a 100 GB network!

2020-01-11Network connectivity between O2 and networks outside out the HMS data center.

Maintenance window: 4am - 8am

Network maintenance being performed on the HMS core network will result in outages of < 5 minutes between O2 and all external networks, including the HMS Quad and all Harvard networks.

Impact:

  • Batch jobs which are already running on O2 will continue to run normally.
  • Interactive jobs will get killed.
  • Jobs which rely on connections to external networks (e.g. to download data) will also be affected during these outages.

This work over Jan 11-12 is being done to increase network performance in the HMS data center. After completion, all HMS systems hosted in the data center (including O2, storage, virtual machine infrastructure) will be running on a 100 GB network!

2019-09-02/n/scratch2

Unplanned service degradation for /n/scratch2 filesystem.

  • Date: Monday Sept 2 2019
  • Duration: 5.00AM to 11.30AM.

Resolved by stopping a service that is misbehaving on the filesystem. Working with Vendor to prevent issues like this in future.

2019-08-25O2 job submissions / queries

The O2 cluster will have planned maintenance during this window:

  • Begins: Friday Aug 25 2019 , 08:00AM
  • Ends: Sunday Aug 25 2019, 11:59PM
    • Maintenance was completed by 06:00PM on Aug 25

An update for the /n/scratch2 filesystem will requires a service outage for all O2 systems. Cluster services will be restored as soon as possible on Sunday 8/25, although the outage is scheduled for all day, as needed.

No user data will be deleted or otherwise changed during the outage. But, as a precaution, please make sure you have copies of any critical data under /n/scratch2 in particular, since that filesystem is not backed up.

Cluster jobs will not be able to run during the upgrade, so we have configured Slurm such that:

  • Any job submitted with a wall time which crosses into the maintenance window will remain pending until the outage is over. 
  • If there are any running jobs on O2 when the outage begins (e.g. long jobs that were started awhile ago), they will be paused and Slurm will attempt to restart them after the outage, but we cannot guarantee such jobs will run successfully.

During the outage, you WILL NOT be able to:

  • Login to O2 login servers nor file transfer servers
  • Run any Slurm commands, such as: sbatch, srun, [etc.]
  • Run nor start any cluster jobs on O2

Websites hosted by Research Computing will not be functionally affected, unless they submit jobs to the cluster (only a few websites do this). But, web developers will be unable to login and edit files.

2019-08-23 →

2019-08-25

/n/scratch2

Planned service outage for /n/scratch2 filesystem:

  • Begins: Friday Aug 23 2019 , 08:00AM
  • Ends: Sunday Aug 25 2019, 11:59PM
    • Maintenance was completed by 06:00PM on Aug 25

An update for the /n/scratch2 filesystem requires a service outage. Service will be restored as soon as possible on Sunday 8/25, although the outage is scheduled for all day, as needed.

During this outage, all other O2 cluster services will be up and running until Sunday morning 8/25 (see below).


Please note:

  • We will disable the auto-deletion script for old files under /n/scratch2 for a few days after the outage.
  • For jobs requiring /n/scratch2 which may need to run during this outage window, make sure to submit those with the following sbatch option so they will not start running until the maintenance is completed:  --constraint=scratch2
2019-08-21O2 job submissions / queries

The Slurm job scheduler went offline at approximately 3:30am on 2019-08-21 . We are currently working to restore this service.

  • 7:30am: The Slurm job scheduler has been restored to service, and O2 job submissions should be operating normally again.
    We are still investigating the root cause of this issue.
2019-08-17

O2 logins

Slurm job submissions

Scheduled power maintenance at Datacenter led to an unexpected power outage causing login nodes and other critical infrastructure services not respond. The issues is fixed by restoring power.

  • Date: Saturday August 17 2019
  • Duration: 6.30 AM to 6.00PM
2019-08-09

O2 logins

/home filesystem experienced a service degradation that resulted in not allowing users to login to O2 cluster and submit jobs. The issue has been fixed by vendor.

  • Date: Friday August 9 2019
  • Duration: 8.00 AM to 11.00AM
2019-07-07O2 logins

A network firewall issue during planned maintenance caused O2 cluster logins to fail and new SLURM job submissions to remain pending. Jobs already running on compute nodes should not have been affected.

  • Date: Sunday July 7 2019
  • Duration: 6.50 AM to 8.00AM


2019-06-30 → 2019-07-01network issues

unplanned service outage for all of o2 cluster. One of the networking devices failed and caused multiple issues across HMS including o2 cluster logins and SLURM job submissions. The

  • Date: Sunday June 30 2019
  • Duration: 10.30 PM to 3.30AM

Issue is resolved by replacing the faulty hardware.

2019-05-24 → 2019-05-25/n/scratch2

Unplanned service degradation for /n/scratch2 filesystem.

  • Date: Friday May 24 2019
  • Duration: 10.30PM to 1AM.

Resolved by restarting a service on the filesystem.

2019-03-{18-22}/n/scratch2

Unplanned service degradation. The /n/scratch2 filesystem is currently showing intermittent instability. We are monitoring it closely and will be implementing a number of hardware and software fixes this week resolve the performance problem.

  • Duration: 4 days
Implemented hardware and software fixes to resolve the core issue on the scratch2 fileserver.
2019-03-09Slurm Job Scheduler

The Slurm Job Scheduler will have planned maintenance during this window:

  • Date: Saturday, Mar 9
  • Time: 08:00-19:00

Cluster jobs will not be able to run during the upgrade, so we have configured Slurm such that:

  • Any job submitted with a wall time which crosses into the maintenance window will remain pending until the outage is over. 
  • If there are any running jobs on O2 when the outage begins (e.g. long jobs that were started awhile ago), they will be paused and Slurm will attempt to restart them after the outage, but we cannot guarantee such jobs will run successfully.

During the outage, you WILL still be able to:

  • Login to O2 to access data
  • Copy data to/from the O2 file transfer servers (transfer.rc.hms.harvard.edu) – except to /n/files (due to the storage outage for /n/files also on Mar 9)

During the outage, you WILL NOT be able to:

  • Run any Slurm commands, such as: sbatch, srun, [etc.]
  • Run nor start any cluster jobs on O2

Websites hosted by Research Computing will not be affected, unless they submit jobs to the cluster (only a few websites do this).

2019-03-09/n/files filesystem

The research.files server will have planned maintenance during this window:

  • Date: Saturday, Mar 9
  • Time: 09:00-15:00

During this window, the directory /n/files will not be available from the O2 file transfer servers and compute nodes.

2019-02-28/n/scratch2

Unplanned Outage: A performance degradation on /n/scratch2 could cause jobs using /n/scratch2 to fail.

Duration: 7.00AM - 9.00PM

2018-12-05/n/scratch2 filesystemThe automated process that deletes old files under /n/scratch2  (specifically, files that were last accessed more than 29 days ago), was intentionally disabled by Research Computing for approximately the past month due to an issue on the scratch2 fileserver. So, there are currently files older than 30 days on /n/scratch2 which have not yet been purged as they normally would have been.

We fixed that fileserver issue and resumed the purging of these old files starting Wed, Dec 5.
2018-12-03O2 logins

All O2 cluster logins from outside of the HMS network will start requiring two-factor authentication.

For more details, please see: Two Factor Authentication (2FA) on O2 and Two Factor Authentication FAQ

Currently, O2 only requires a password login using your eCommons ID. Due to increased hacking attempts on O2, it is necessary to increase the security of our systems and going to two factor authentication is a big step.

HMS users already must use two factor authentication for Harvard Key and HMS VPN logins. O2 logins will work similarly.

Two-factor authentication will be required when logging in from:

  • the HMS Public wireless network
  • Other Harvard networks (FAS, etc)
  • Networks at HMS affiliates (hospitals, etc)
  • Any other external network (home, etc), NOT using the HMS VPN
  • an HMS system (even on campus) which has a public-facing IP address (this is mostly for web and other application servers, not your desktop)
2018-11-28

MySQL and PostgreSQL Databases

TWiki server

A planned maintenance window at:

Wednesday, 2018-11-28, 6pm - 7pm

for the following services:

  • PostgreSQL     (production and staging database servers)
  • MySQL             (production and staging database servers)
  • TWiki                (the website: wiki.med.harvard.edu)

Only websites and cluster jobs using these database services were affected.

2018-11-20/n/scratch2

Intermittent storage issues affected the availability of the /n/scratch2  directories across O2 systems.

Duration: 6.00 AM - 6.00 PM

2018-10-24

/n/groups

/n/data2

Intermittent storage issues affected the availability of the /n/groups and /n/data2  directories across O2 systems.
2018-10-10authentication service

Instability in O2's authentication service was causing some user accounts to lose group memberships across O2 systems.

Services were restored to normal at approximately 10:18am
2018-10-01/n/scratch2 directoryWhen attempting to write to files under /n/scratch2 , you may see errant behavior such as:
  • Files are successfully written, but warning/error messages are generated
  • Files can not be written, with error messages such as "Bad Address"

Issue was resolved with a bug fix on the scratch2 storage server.

2018-09-08O2 Login servers

Unplanned Outage: a core HMS network outage caused o2 login nodes unreachable. The issue is resolved by HMS Networking team

Duration: 02.30 PM - 5.30 PM

2018-08-17PostgreSQL (production, staging)

MySQL (staging)

Request Tracker (RT)
These will be offline for approximately 1 hour starting at 9pm EDT for urgent maintenance.
2018-08-14O2 Cluster and web services

Unplanned outage: a failure in the HMS virtual machine hosting infrastructure caused service outages in Research Computing's web services and, to a lesser extent, on the O2 cluster. The outage did not affect running cluster jobs, though.

Duration: 02:20 pm - 06:20 pm

2018-08-06O2 Cluster

Unplanned outage: Cisco networking hardware failed and caused many jobs to fail. The defect hardware has been replaced and everything is stable.

Duration: 05:00 am - 08:00 pm

2018-04-25 → 2018-04-26O2 login servers2 login servers, login03 and login05, required reboots due to resource-intensive end user processes locking up those systems.
2018-04-11O2 /home cluster

A severe network latency to the /home storage cluster impacted logins and processes trying to access this cluster. Duration: 11:00am - 05:00pm

2018-04-10O2 ClusterUnplanned outage: networking issues disrupted communication to/from the login nodes.  Running/pending jobs were not impacted.
2018-04-03/home filesystem
The fileserver for /home was getting close to maximum capacity and running on older hardware.

This planned maintenance involved migrating all /home to data to a new fileserver with more capacity. This required a full shutdown of O2's Slurm job scheduler and unmounting /home from all cluster and infrastructure systems.

2018-03-13 → 2018-03-14

/n/scratch2 filesystem

A hardware failure on the /n/scratch2 fileserver resulted in /n/scratch2 being non-writable.

On 3/14, hardware was replaced and the filesystem repaired, after which service returned to normal.




  • No labels