Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Git Tip of the Week: GC and Pruning

Gtotw 2011 Git

This week's Git Tip of the Week is about git gc. You can subscribe to the feed if you want to receive new instalments automatically.


Paul Webster recently wrote Where the git did that go? in relation to the incredible disappearing commits. Some said that a version control system shouldn't be able to do this; but it's actually all part of Git's functionality. Let's look at what happened.

A git repository stores commits in a transitive closure (a real closure, not a lambda) from the reachable items through to every commit (and every tree and blob). It's not possible to remove a commit – and therefore the trees and blobs that make up that repository.

So, how is it possible to lose data with Git? Well, if you are using a standard Git repository, you can create branches with git branch; and delete them with git branch -d. When you delete a branch, you remove the pointer to the last commit – but you don't actually lose the commits.

In addition, the git reflog, which we covered previously, stores a list of the previous branch pointers. In other words, even if you delete a branch, the reflog has got your back.

Generally speaking however, only repositories with working directories have reflogs; bare repositories tend not to. There is a config option, git config core.logAllRefUpdates, which can be used to force it on all repositories – or disable it completely if it's not needed.

Even without a reflog, the commits aren't removed immediately. If you run a git gc, which repacks the repository into a more efficient structure, it will export non-referenced commits as loose objects. (You have to ensure that there aren't any branches or tags or reflogs to see this behaviour; if there's an existing pointer then it will not evict the object from the packfile.)

Running a git fsck will check that all objects are present as expected. You can also see what is no longer referenced; running git fsck --unreachable will show you which commits are no longer reachable due to deleted branches or removed tags. Running git fsck --unreachable daily and mailing reports will give a good early warning of commits about to disappear if it's a concern.

Objects which are no longer referenced can be evicted with git prune; though this is a low-level operation which is often called from git gc. By default it will not remove commits newer than 2 weeks old, and of course the commits that are reachable from that; so provided the branch (or tag) deleted has recent commits, it will stay around in the git repository for up two a fortnight afterwards.

Avoiding future issues

Both branches and tags can be deleted; and when invoking a remote push operation a missing branch (or tag) on the client side can invoke a delete; for example: git push github :refs/heads/master will delete the 'master' branch off the remote repository known as github. If this is in a script, such as git push github $COMIT:refs/heads/master and the variable is misspelled (therefore evaluates to the empty string) this can inadvertently delete the branch. (The same is true for tags in :refs/tags/.)

A remote repository can disable such operations with the setting receive.denyDeletes to prevent any ref deletion, and avoiding non-fast-forward branches with the receive.denyNonFastforwards. If either of these are set, then deletes have no operation and pushes cannot overwrite code which doesn't strictly follow it in history. (This is occasionally a useful operation; it may be necessary to provide a means to elevate this in certain situations if necessary.)

In addition, ensuring that branches have core.logAllRefUpdates will ensure that the repository still keeps the history of the branches, at least for gc.reflogexpire and gc.reflogexpireunreachable days.

Summary

Whilst git can be used, there are powerful options which can tweak or constrain its behaviour. In the face of scripts which have full access to the remote repository, it is advisable to have a more controlled set of options rather than the default you-can-do-anything approach. With this knowledge in mind, you should be able to set your options appropriately for your environment.


Come back next week for another instalment in the Git Tip of the Week series.