Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Git Tip of the Week: Objects and Packfiles

Gtotw 2011 Git

This week's Git Tip of the Week is about objects and packs. You can subscribe to the feed if you want to receive new instalments automatically.


So far, we've talked about commits, trees and objects. We've seen how they bind to the logical object model as well as being represented on disk in the .git/objects directory.

But storing every version of every file in separate files (albeit compressed) is going to be a huge waste of space, right? Yes, there's some sharing of identical content between commits, but Git would hardly be the efficient store that it's known for with storage structure like that.

Pack files

Fortunately, Git has the ability to merge together multiple objects into single files, known as pack files. These are, in essence, multiple objects stored with an efficient delta compression scheme as a single compressed file. You can think of it as akin to a Zip file of multiple objects, which Git can extract efficiently when needed.

Pack files are stored in the .git/objects/pack/ directory. For new projects, this is likely to be empty; what happens is that Git starts off adding all files as non-packed objects, or loose objects. One of the reasons it does this is because as you're working through changes, you're quite likely to re-write various files (blobs) and directories (trees) before you commit. In fact, each time you do a git add to stage a file, you're creating a new object in the loose objects structure.

What happens is that periodically (or on user demand), Git will run a compression on the loose objects. This is triggered either by a git gc request, or automatically after various thresholds have been met. Git will then create the pack file and remove the loose object files.


(master) $ touch empty
(master) $ git add empty
(master) $ git commit -m "Empty"
[master (root-commit) cab1545] Empty
 0 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 empty
(master) $ ls .git/objects/
41	ca	e6	info	pack
(master) $ ls .git/objects/pack/

You may recognise the 'e6' directory as being the prefix of the empty file in Git, which we covered earlier and is identified by e69de29bb2d1d6434b8b29ae775ad8c2e48c5391. However, at this stage, there's no content in the pack directory. What happens if we pack it?


(master) $ git gc
Counting objects: 3, done.
Writing objects: 100% (3/3), done.
Total 3 (delta 0), reused 0 (delta 0)
(master) $ ls .git/objects/
info	pack
(master) $ ls .git/objects/pack/
pack-0c1ff4e31096ebcdb390b30ebe763ae15de650eb.idx
pack-0c1ff4e31096ebcdb390b30ebe763ae15de650eb.pack

Where did the objects go? Well, they've been compressed into a single read-only pack file. We can still address them using their hash, even if they're not loose files any more:


(master) $ git cat-file -t e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
blob
(master) $ git cat-file -s e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
0

The pack file's contents on disk is smaller than the set of files on their own (though in trivial examples like this, there isn't that much difference between them). The pack file is actually made up of two entries; the index (.idx) and the pack (.pack) files. Whilst the latter stores data, the former stores a table-of-contents list of objects contained within the pack itself:


(master) $ hexdump .git/objects/pack/pack-0c1ff4e31096ebcdb390b30ebe763ae15de650eb.idx
0000000 ff 74 4f 63 00 00 00 02 00 00 00 00 00 00 00 00
0000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0000100 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01
0000110 00 00 00 01 00 00 00 01 00 00 00 01 00 00 00 01
0000330 00 00 00 02 00 00 00 02 00 00 00 02 00 00 00 02
00003a0 00 00 00 03 00 00 00 03 00 00 00 03 00 00 00 03
0000400 00 00 00 03 00 00 00 03 41 7c 01 c8 79 5a 35 b8
0000410 e8 35 11 3a 85 a5 c0 c1 c7 7f 67 fb ca b1 54 54
0000420 f6 75 88 fb 81 27 96 75 09 38 77 09 7a 75 21 de
0000430 e6 9d e2 9b b2 d1 d6 43 4b 8b 29 ae 77 5a d8 c2
0000440 e4 8c 53 91 7d 73 67 fc 61 d0 d2 e8 6e 76 00 29
0000450 00 00 00 85 00 00 00 0c 00 00 00 b1 f5 5c c2 b5
0000460 8e 21 45 55 02 06 88 64 e9 1b 8b 52 75 c4 46 3d
0000470 57 df 0b ca 6a 8f f3 57 6c d4 97 78 df 30 1d bc
0000480 4d 24 1e a4
0000484

You'll recognise in the hex dump of the index the 'empty object' stored in Git (e69d..5391), along with the tree containing the empty file (417c…67fb).

The purpose of the index file is really a marker to tell Git that the corresponding object is in this pack file. In this case, we've only got one pack file but large repositories will have many such files. The index allows Git to load many small files to determine the answer to “Where are these objects?” so that it can extract them in the most efficient manner.

Summary

Whilst Git stores objects in loose form whilst you work on new changes, it will compress them into pack files to take greater advantage of delta compressions. This happens when you run a git gc or when various thresholds are met automatically. It also explains why Git's storage requirements follow a sawtooth like structure; each time the ramp goes up, it's because new objets are being created, and each time it goes down, it's because a pack has been run and new pack files have been created (along with the corresponding objects being deleted).


Come back next week for another instalment in the Git Tip of the Week series.