Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Git Tip of the Week: Git BigJobbies

2011, git, gtotw

This week's Git Tip of the Week is about git bigjobbies. You can subscribe to the feed if you want to receive new instalments automatically.

This tip covers two aspects; firstly, a means to show how Git can easily be extended, and secondly, a means to show that Mercurial's large file support can be implemented relatively easily on top of Git's object store. It should be noted that git bigjobbies is not intended for production use, but as a learning experiment.

Recap of object stores

The git database stores a set of objects by hash. These hashed objects may point to one of objects, trees or commits. Ultimately, a branch (or tag) in Git is just a pointer to a commit, which points to prior commits and a tree; trees points to a recursive graph of trees and blobs.

As a result, you can stick anything you want into a Git repository, provided it's inserted into the hashed object database. In addition, when you clone/fetch/pull from a Git repository, you don't necessarily get everything that the repository contains; you instead get all the reachable commits (and thus transitively, reachable trees and blobs) for the ones you don't have yet. (In the case of a clone, the set of things you have is the empty set which makes the calculation trivial.)

However, you don't get the objects that aren't reachable when you clone. So, failed experiments that didn't work, suggested changes that were not accepted in a Gerrit workflow (or reworked to provide a different implementation), or just branches or offshoots that you're not interested in, are not downloaded when you clone a repository. (Commits which are directly ancestral are of course brought down; only the divergent parts are not downloaded.)

Unreachable objects are ultimately pruned by the garbage collector. Working from known list of roots (e.g. tags, branches) the git gc can work out what objects are no longer reachable from any reference, and ultimately prune them from the record.

We can use the object database to our advantage, to store out-of-band object data in a repository which is not reachable from the branch, but is still referenced in refs and thus resolvable from the centralised decentralised version control system. Enter:

Git Bigjobbies

Git Bigjobbies is an extension I created to demonstrate out-of-band objects being stored in a Git repository. Note that this is neither supported nor recommended. With that out of the way, what does it do and how does it work?

(master) $ touch empty
(master) $ git bigjobbies add empty
(master) $ git status
# On branch master
# Untracked files:
#   (use "git add <file>…" to include in what will be committed)
#	.bigjobbies
#	.gitignore
nothing added to commit but untracked files present (use "git add" to track)
(master) $ cat .bigjobbies
e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 empty
(master) $ cat .gitignore

The extension writes the object into the database with git hash-object -w, then concatenates it to a file .bigjobbies. Although the object isn't in the tree or referenced by a commit, it still exists in the database. As a result, we can resolve the contents using the hash and into the filesystem, which we record in the .bigjobbies file. Provided this is committed into the branch, we can resolve the file using the hash alone.

But how do we prevent the object being garbage collected when it's not available? Through the general refs/ directory. If an object is referenced from a refs/ file, it will be seen as in use and therefore not garbage collected.

To write a ref, we just need to echo the hash out to a file in the refs directory. It doesn't matter what it's called – so for simplicity we just write out the hash value as the name. To separate it from ordinary git tags and branches, we use refs/bigjobbies/e59de..391 as the name.

Now, when we resolve the objects, we get the contents from the hash in the local store (if it exists); and if not, we resolve via the origin refs/bigjobbies/369de..391 remote reference. As with the Mercurial largefiles extension, it doesn't download the contents of the files unless they're needed; but on the downside, it does need the files to be downloaded ahead of time in order to work off-line. Let's look at how it would work in a clone:

$ cd /tmp
$ git clone /tmp/example other
Cloning into other...
$ cd other
(master) $ ls -a1
(master) $ ls -a1
(master) $ cat .git/refs/bigjobbies/e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
cat: .git/refs/bigjobbies/e69de29bb2d1d6434b8b29ae775ad8c2e48c5391: No such file or directory
(master) $ git bigjobbies resolve
(master) $ ls -a1
(master) $ cat .git/refs/bigjobbies/e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

The resolve command has dynamically brought in the reference from the remote server and resolved the contents of the file in the local repository. Furthermore, any large files in interim commits will not be resolved, unless they too are mentioned in the .bigjobbies file.


The point of this was to demonstrate how easy it is for a git extension to be made. All you need do is put the executable with git-bigjobbies as the prefix, and you can run it with git bigjobbies.

In addition, it's a good exercise in understanding how the Git repository works. References are just pointers to hashes, and objects can be stored and referenced by those same hashes. From this, the entire Git tool suite is written; a combination of C and other scripting languages (for example, git-svn is largely written in Perl, and GitHub operates mostly out of Ruby).

You can clone the BigJobbies.git repository from the GitHub repository at The repository that you clone already has some BigJobbies in them; if you do a git bigjobbies resolve at any of the points which have a .bigjobbies file, you will find them downloaded. (Note that an implementation bug relies on the remote being called origin, in case you do git clone -o other.)

Come back next week for another instalment in the Git Tip of the Week series.