User Documentation

Skip to end of metadata
Go to start of metadata

The project homepage of git-annex characterises the tool as follows:

"git-annex allows managing files with git, without checking the file contents into git. While that may seem paradoxical, it is useful when dealing with files larger than git can currently easily handle, whether due to limitations in memory, time, or disk space."

While this highlights one of the main features that allows to manage big (> 1GB) the git way, it doesn't mention another important aspect. As the big files are not checked into the repository, you can make clones of it which do not contain all the data, but only some part of it. Therefore you can keep some big simulation result on cluster A, some other on cluster B and some post-processed data on Laptop C, but all under the roof of one git-annex repository. git-annex provides features to easily manage such use cases.

To conclude, git-annex addresses two issues with HPC data such as the output (and input) of climate models: firstly, they are big and secondly, they are usually not in one place.

In addition to that, git annex provides a lot more functionality, covering, amongst others, collaboration and archiving in the cloud. A good starting point for learning about git-annex is the online walkthrough.

A word of warning: git-annex is made for big files, but not for many (> 500000 per repository). Use tar archives to reduce the number of files if necessary.


Binaries for recent linux systems are provided (which should run on Cy-Tera once the operating system has been upgraded to RHEL 6). Untar the archive,

and run

to be dropped to a shell with all the git-annex commands available.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.