Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Git Tip of the Week: Trees

Gtotw 2011 Git

This week's Git Tip of the Week is about git tree storage. You can subscribe to the feed if you want to receive new instalments automatically.


Last week, we looked at how Git stores objects in the local repository. This week, we're going to look in to how they correspond to directories, or trees.

Git uses a uniform storage model for all of its objects. Each object is identified with its hash, but the type of the object is stored in metadata along with the object. Thus, it's possible to find out from an ID what its type is, as well as its content:


(master) $ # Note: objects from previous
(master) $ git cat-file -t e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
blob
(master) $ git cat-file -p e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
(master) $ git cat-file -t 8ab686eafeb1f44702738c8b0f24f2567c36da6d
blob
(master) $ git cat-file -p 8ab686eafeb1f44702738c8b0f24f2567c36da6d
Hello, World!

How do these objects get packaged up, so that you can get them in your working directory? Well, blobs are arranged in trees, which corresponds to directories in a directory structure. If we have a directory with a file called empty, we can print out its contents:


(master) $ git ls-tree master .
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	empty

Note that this isn't listing the contents on disk; rather, it's showing you Git's view of the folder. This allows us to list items on different branches (or tags) without needing to check them out first, and in fact, is how hosting sites like GitHub and tools like GitWeb work. The master is simply asking to show us the branch with the same name.

What happens if we add another file, with the same contents?


(master) $ cp empty anotherEmpty
(master) $ git add anotherEmpty
(master) $ git commit -a
[master ca5fc4f] Another empty
 0 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 anotherEmpty
(master) $ git ls-tree master .
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	anotherEmpty
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	empty

We have a new entry in the tree, but the blob pointer is to exactly the same object, like a hard link on a UNIX filesystem. As with a hard link, if we change one of the objects, we don't change the contents; rather, we create a new copy (since it has a different hash) and the tree is updated to point to that instead.

How is this tree stored in the repository, though? Well, it turns out that it's another object type, stored in the same mechanism as blobs. You can find out the tree from a commit (or branch) with the ^{tree} suffix:


(master) $ git cat-file -t HEAD^{tree}
tree
(master) $ git cat-file -p HEAD^{tree}
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	anotherEmpty
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	empty
(master) $ git rev-parse HEAD^{tree}
2b61e34a91ca9780ea2f943e72f1a4a022cdd206

The tree represents a directory, containing a mixture of blobs and trees. We can find out what it resolves to using git rev-parse, to determine that this tree is an object which hashes to 2b61e34....

How is this tree created? Well, again, it's a well-formatted object which is hashed through its sha1 value. The object type is tree, and instead of having simple values like the blob, the tree is a set of index values pointing to the objects, along with a mode (typically 100644 for files and 100755 for directories). However, we know the size of the SHA hash, so it doesn't need to be in human-readable numbers; we can serialize it out as bytes. The length works out at 28 bytes per row, plus however many bytes there are in the file name. In our case, we have 28 + "anotherEmpty".length() + 28 + "empty".length(), or 73 bytes in total:


(master) $ echo -en "tree 73\x00→
100644 anotherEmpty→
\x00→
\xe6\x9d\xe2\x9b\xb2\xd1\xd6\x43\x4b\x8b→
\x29\xae\x77\x5a\xd8\xc2\xe4\x8c\x53\x91→
100644 empty→
\x00→
\xe6\x9d\xe2\x9b\xb2\xd1\xd6\x43\x4b\x8b→
\x29\xae\x77\x5a\xd8\xc2\xe4\x8c\x53\x91→
" | shasum
2b61e34a91ca9780ea2f943e72f1a4a022cdd206  -

Creating a tree, on the other hand, is a little more tricky. To solve this problem, the git mktree command exists, which can take a git ls-tree formatted stream, and generates a tree object for you. It's a little like the git hash-object from above, but without having to convert the references from the string hash to a sequence of hex characters. In addition, it also ensures that the tree's contents are appropriately sorted, which is a mandatory pre-requisite (in order to support fast retrieval).


(master) $ git ls-tree master .
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	anotherEmpty
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	empty
(master) $ git ls-tree master . | git mktree
2b61e34a91ca9780ea2f943e72f1a4a022cdd206

This allows us to easily create a new tree, with a new file in it:


(master) $ echo -en→ "
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391\tanotherEmpty\n→
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391\tempty\n→
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391\tvoid\n" | git mktree
d2d6bbd1c25c154fcbb045d66e8a6f9b83587a68

We've now got three files in a tree (all the same contents; all empty), but now we can referr to the new tree directly. We can even list it again:


(master) $ git ls-tree d2d6bbd1c25c154fcbb045d66e8a6f9b83587a68
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	anotherEmpty
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	empty
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	void

Although we haven't shown it here, if you wanted to create a tree with other trees (instead of blobs) then they work in exactly the same way; the difference is the word 'blob' is replaced with 'tree', and of course, the object has to point to the right hash.

We've now seen blobs and trees; next week, we'll have a look at how they turn up on branches in the form of commits.


Come back next week for another instalment in the Git Tip of the Week series.