You control where your data resides on LinkSCEEM resources and selection of appropriate storage is important for efficient management of your job output. While the disk quota, backup, and purge policies of a given site may seem to be the most obvious factors in determining where you will store different kinds of data, there are other critical factors that affect the performance of your applications and those of other users. Moreover, appropriate use of different kinds of file systems is crucial to protecting the stability of LinkSCEEM systems.
In this section, you will learn:
- About the types of file systems associated with LinkSCEEM resources and how to use them effectively
- Basic commands for working with home, data and scratch directories
Deciding what type of storage to use for your data affects how efficiently you are able to manage your job output. Criteria you should consider include
- how much data you need to store
- how long you intend to store the data
- how easily you want to be able to access data
- how quickly you need to transfer data
On LinkSCEEM resources, you generally have two types of storage to choose from.
- Home directories are permanent, but have relatively small quotas. Typically, it can be accessed with the environment variable $HOME. This space is visible by all the nodes in the cluster, including the login nodes. Data in this area is normally not purged. Your home directory is best for compiling and building programs and for storing output that consists of large numbers of small files. NOTE: you are still asked to backup your important data outside of the HPC systems.
- Parallel file systems (lustre, gpfs or similar) are fast, large, but temporary; this space is accessible to all nodes in a LinkSCEEM cluster/machine, including the login nodes. PFS are optimal for single, large data files that may be generated in the course of a job runtime. Typically, it can be accessed with the environment variable $WORK.
In addition to types of storage, file systems themselves vary from one LinkSCEEM Site to another. There are four criteria to consider in determining which file system to use:
- quota size
- backup and purge policies
In general, if you plan on moving data across sites, you should use a parallel file system (PFS) for temporary storage of intermediate-to-large sized data files. Also known as a fast file system, a PFS consists of multiple servers (as many as several hundred) on which data is stored after having been broken up or "striped". A PFS also includes a small number of metadata servers, which store information necessary for retrieving files.
The location of mass storage or the scratch directory may differ from site to site. There are common environment variables ($HOME, $WORK) that are used across all sites to provide a uniform syntax for referring to the location of each site's storage, thereby hiding the underlying differences in their paths. However, to move data from one resource to another, you must specify explicit paths rather than using environment variables.
Home directories at each LinkSCEEM site have enforced quotas. Parallel file systems are more flexible in that their directories are shared by all users of a particular resource. Thus, the space available on parallel file systems on a specific resource at a given time depends on how much data other users are also storing on that resource at that time.
Before sending larger data outputs to one of these file systems, find out how much space is available with the df command:
Backup & Purge Policies
Normally, home directories at each site are backed up; parallel file systems are not. Fact is, the policy can be different from site to site. It is important that no matter what the backup and purge policies at an individual site are, you should back up your valuable data frequently.
Your home directory ($HOME) is a good choice for building software and working file collections of small to medium sized files, where a medium sized file is less than 50MB. Examples of commands that might make sense and perform well in $HOME are:
When an operation may create or remove a large number of files, it's best to use $HOME with the command.On the other hand, for large file I/O and large read/write activity, the work file systems are best. Here are some example uses of $WORK. These types of commands might typically be contained in a batch job script: