Skip to end of metadata
Go to start of metadata

Where do I put my files? It Depends

O2 and Orchestra have access to most of the same filesystems. For example, if you modify a file in your home directory on Orchestra, it will reflect those modifications if you log into O2. 

For most users' purposes, a filesystem is just a directory, like /n/groups or /home. However, the different filesystems have different speeds of reading/writing data, different backup policies, different access permissions, and how much can be stored on them. The filesystem quotas page describes how to find out how much space you are using.

Where to put your files depends on how big they are, who needs to see them, whether they are temporary, and how you will be accessing them. (That said, it is almost never a good idea to have more than, say, 10,000 files in a directory. Your work and others' will be faster if you split that huge directory into a bunch of smaller sub-directories.)

There are a few important differences between O2 and Orchestra:

  • /groups on Orchestra is mounted on O2 as /n/groups
  • O2 login and compute nodes do not mount /n/files  (i.e. research.files.med.harvard.edu).
  • You can access /n/files on designated systems using the transfer partition. To run jobs in this partition, you will need to request access. See File Transfer for more information.  
  • Orchestra's filesystems for web ( /www ) and code repositories ( /srv/git and /srv/svn ) are not available on O2 at this time.

NOTE: None of the standard filesystems are automatically encrypted, and cannot be used for HIPAA-protected or other secure data (Harvard's data security level 3 or above) unless those data have been de-identified.

Home directory (/home/ab123)

Every user gets a home directory. A user with eCommons ID ab123 will have a home directory in /home/ab123. When you login, you will be in this directory. This is a good place to put small data sets, lab notes, scripts, and important analysis results. Your home directory is of limited size, so if it fills up you'll need to use other filesystems. Home directories are backed up nightly and snapshotted.

For a small data analysis no requiring large data sets or huge output, a standard workflow would be:

  • (Optionally) Copy data from a desktop or other location to the home directory
  • Run analysis, writing output to the home directory
  • (Optionally) Copy data back to a desktop

Group directories

  • /n/groups/mygroup/
  • /n/data1/institution/department/lab/
  • /n/data2/institution/department/lab/

A group directory is used by a lab (or a set of researchers sharing data). These directories can be read by any member of the lab, which is quite useful when multiple researchers need to see the same data. Unlike home directories, the entire lab directory has a quota, and lab members work together to keep the space from filling up. These directories are used for large data sets, reference data, or scripts used by a whole lab. Use the Storage Request Form to get a directory for your lab, or to increase its quota. Group directories are backed up and snapshotted. PIs are not currently charged for storage, but may be charged for usage beyond a certain base level in the future. 

You might run an analysis on data in your home directory using reference data from your lab directory. You might then put results into the lab directory for other lab members to use.

Scratch directory (/n/scratch2/ab123)

Each user is entitled to space (10 TB) in the /n/scratch2 filesystem. You can create your own directories inside /n/scratch2/ and put data in there. These files are not backed up and will be deleted if they are not accessed for 30 days.

Scratch will not work very well with workflows that write many thousands of small files. It is designed for workflows with medium and large files (> 100 MB). Luckily, many next-gen sequencing analysis, image analysis, and other bioinformatics workflows use large files.

HMS RC does not recommend using "striping", or reading/writing a single file through multiple "pipes" to the filesystem, for most cases when writing to the /n/scratch2 filesystem. Contact rchelp@hms.harvard.edu if you have any questions.

For workflows that allow for full control of temp/intermediate files, you can leave your input data in a /n/groups or /home directory, make the first step in the workflow read from the original directory, do all of the temp/intermediate writes to /n/scratch2, and perform the final write back to /n/groups or /home. So in a 5 step pipeline, step 1 reads from /n/groups or /home, steps 2-4 write intermediate files to /n/scratch2, and step 5 reads from /n/scratch2 and writes back to the final output /n/groups or /home directory. Here is a suggested workflow:

  • Create a directory in /n/scratch2 if needed. We recommend you use your eCommons ID.
  • Set up your workflow so that the input is read from /n/groups or /home, but temporary/intermediate files are written to /n/scratch2
  • Write any needed results back to /n/groups or /home
  • Delete temporary data, or let it be auto-deleted

For workflows that write temp/intermediate files to the current directory, you can create a directory in /n/scratch2 and cd to it. Run the workflow from /n/scratch2, specifying full paths to input files in /n/groups or /home and full final output paths to /n/groups or /home. Here is a suggested workflow:

  • Create a directory in /n/scratch2 if needed. We recommend you use your eCommons ID.
  • Set up your workflow so that full paths are used to refer to input files in /n/groups or /home.
  • Change directories (cd) to your /n/scratch2 directory, and run the analysis from there
  • Write or copy any needed results back to /n/groups, /home, or your desktop, with copies submitted as an sbatch job or from an interactive session (e.g. srun --pty -p interactive -t 0-12:00 /bin/bash)
  • Delete temporary data, or let it be auto-deleted

For workflows that allow little flexibility in the location of temporary/intermediate files, data can be copied over to /n/scratch2, computed against there, and copied back to /n/groups or /home. This creates a redundant copy of the input, takes up storage space, and requires time to transfer the data to and from /n/scratch2. Here is a suggested workflow:

  • Create a directory in /n/scratch2 if needed. We recommend you use your eCommons ID.
  • Copy data from /n/groups, /home, or your desktop to /n/scratch2. We recommend submitting this as an sbatch job, or be copied from an interactive session (e.g. srun --pty -p interactive -t 0-12:00 /bin/bash)
  • Run the analysis in /n/scratch2, writing all temporary/intermediate files to this space
  • Copy any needed results back to /n/groups, /home, or your desktop, again as a job or from an interactive session
  • Delete temporary data, or let it be auto-deleted

Restoring Backups

Most shared filesystems retain snapshots for up to 60 days, the exception being temporary filesystems. If snapshots are available for a directory, they are located in a hidden directory called .snapshot. (This directory will not be visible by doing an ls or even ls -a.) . To retrieve a backup:

  • From a command prompt on O2, type cd .snapshot to see available backups of that directory.
  • Inside the .snapshot directory, there will be directories with date/times in their names, containing a copy of all files at that date/time. Each sub-directory will also have its own .snapshot directory.
  • You can't write files to these directories, but you can copy files from here back to the original directories with the cp command.

Copying data to O2 and between filesystems

See File Transfer for information on moving data to/from desktops, or between filesystems.

Shared Filesystems

These filesystems are housed on a central file server and are available from any system within O2.

filesystem

use

/n/groups

shared group data storage (Contact Research Computing if you need a group space)

/n/data1

shared group data storage

/n/data2

shared group data storage

/home

individual account data storage

/n/app

add-on software packages

Note: The /n/files filesystem, which allowed shared group data storage (access to eCommons collaborations), is not accessible from O2 compute or login nodes, only from the transfer partition. This partition has restricted access, so you will need to request access to run jobs there. See File Transfer for more details. Additionally, Orchestra's /www (web hosting) and /srv (for Subversion or Git code repositories) are not currently available on O2. 

Temporary Filesystems

These filesystems tend to allow fast read and writes, but are not backed up If you are doing significant I/O on a networked filesystem (like /n/groups or/home), it is often better to copy files from your home or group directory, process them, and copy output back, than to operate directly on files in your home or group directory. 

/tmp is the standard UNIX temporary directory, and /tmp is a different hard drive on each machine. A file you place in /tmp on a login node is not available in /tmp on a compute node or even on a different login node. If a job writes to /tmp, it will write to /tmp on the node the job is running on. Your job should copy it back to a shared filesystem like /home, because it may get deleted from the compute node.

Temporary filesystems are never backed up and are periodically automatically purged of unused data. The contents of these filesystems may also be deleted in the event of a system being rebooted or reinstalled.

[ Information below here is not important for most users ]

Synchronized Filesystems

These filesystems are housed on local disks on individual machines. We keep these filesystems synchronized using our deployment management infrastructure.

filesystem

use

/

top of UNIX filesystem

/usr

most installed software

/var

variable data such as logs and databases

Synchronized O2 filesystems are never backed up. The source system images from which compute nodes and application servers are built are backed up daily, and these can be used to reinstall a system.

  • No labels