Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Git Tip of the Week: Packfiles redux

2011, git, gtotw

This week's Git Tip of the Week is about packfiles. You can subscribe to the feed if you want to receive new instalments automatically.


When Git compresses files, it does so using pack files, which are collections of objects compressed into a single file.

One key difference between Mercurial and Git is the use of the on-disk storage. Mercurial uses a per-file storage model, where all of the history about that file is stored within that one unit. This is similar to CVS' use of ,v files to store versioned information about a specific file, although Mercurial's format is much better tuned. Git on the other hand has a logical object model, and whether those objects are in a compressed pack file or 'loose' on disk makes no difference to Git.

This enables Git to repack objects, and to generate new pack files, subsequent to initially creating them. As long as the object is available, it doesn't really matter where it came from – after all, the unique hash will always point to the same content in the object regardless of where it is loaded from. Of course, since a pack file is immutable after creation, any changes are implemented by the creation of new pack files and then disposing of the old ones.

Since the pack files contain many objects, it is possible to perform delta compression against the objects. In other words, if there are five versions of a file, all largely the same but with minor differences, then an initial version of the file can be stored as is, and just the minor modifications stored afterwards. (These typically include an insert of a range of bytes, a delete of a range or a transpose of a range of bytes.) So although logically the pack file can contain multiple objects, it actually only stores one set of each change.

There are several ways of storing files in the pack file as well. For example, consider a file with the contents 'A' and a subsequent version 'AB'. This can be stored as an A and then +B, or it can be stored as AB and -B. The latter is more performant, since the set of changes involves a deletion of an existing range, so only needs to store the start and length of the deletion occurring. As files grow over time, it also means that the largest file is often stored as-is (typically, the most recent) which gives faster access for that than for previous versions.

Clones and fetches

The other time pack files get created are when you clone or fetch/pull from a Git repository. The 'smart http' protocol actually uses the Git protocol over an HTTP wrapped connection; what happens is there is a bit of back-and-forth in deciding what hashes you have, and what you want. Ultimately, this conversation results in the server knowing what sets of objects to send to the client.

The most efficient mechanism for sending results back to the client is in the form of a pack file. The server knows what to send back, so it builds a pack file containing just those objects. This will involve compressing (both delta compression as well as deflate compression subsequently) and then sending the objects back. You even see this in the chatter that is sent in the message channel when you clone or update the repository:


(master) $ git pull
remote: Counting objects: 1177, done.
remote: Compressing objects: 100% (352/352), done.
remote: Total 1018 (delta 609), reused 787 (delta 390)
Receiving objects: 100% (1018/1018), 378.46 KiB | 206 KiB/s, done.
Resolving deltas: 100% (609/609), completed with 110 local objects.

The first line is the server counting the number of objects (blobs, trees, commits etc.) that are needed as part of the object set. Roughly this is doing the same kind of work as a git rev-list might do; in other words, it's figuring out the set of objects necessary to send.

The second and third lines is the server saying it's now compressing those objects into a pack file. This is done on the server side before it can start sending any data, as the pack file itself isn't written to in a streaming fashion. In addition, it can write out either full objects or delta objects. (Normally a pack file is self-contained; i.e. it will only create deltas against other objects that are in the same pack file. For clones, the pack file may store references to objects that the client already has, but without copying the object in the pack file itself.)

The fourth line is the client receiving the pack file from the server, once the server has created it. Metadata in the pack file gives an indication of how far it's gone through, which is why you can get an update occurring as the file is received.

The final fifth line is the client assembling the data and closing any of the loops that might be present (i.e. local references to other objects). It will also generate an index file of the objects, since they can be calculated based on the SHA1 hash of the objects themselves.

All of this put together means that when you clone a project for the first time, you end up with a large pack file that contains everything. Subsequent updates can bring down smaller pack files which can build up on the files before. At any time (or when you run git gc manually) these pack files can be reorganised into more appropriate storage units; for example, condensing a number of pack files into one or a few pack files. Since this operation is persistent (i.e. the large pack file will retain its compression behaviour) it may be beneficial to compress a repository periodically with git gc --aggressive.

Summary

The Git pack file is the unit which makes object storage efficient, and also supports the efficient transfer of data between client and server during either an initial clone or a subsequent fetch/pull operation. Although pack files are immutable once created, they can be re-created with more compressed content on an as needed basis.


Come back next week for another instalment in the Git Tip of the Week series.