Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Git Tip of the Week: Objects

2011, git, gtotw

This week’s Git Tip of the Week is about git object storage. You can subscribe to the feed if you want to receive new instalments automatically.

This week we’ll be taking a bit of a deeper dive into the way that Git stores its objects. We’ll look at how they’re identified, how they’re related, and see why Git handles moves better than other version control systems.

By now, you’re familiar with the concept of a commit hash (or just commit) – a 40-character hexadecimal sequence, which can uniquely identify a change log, such as d16085b3b913e5bc5e351c0a7461051e9973629a. But where does this come from?

A git repository is actually just a collection of objects, each identified with their own hash. Whenever you add a file, you get a hash generated on its contents, and this hash is used to uniquely point to that version of a file. For example, if you create an empty file, it will have the hash e69de29bb2d1d6434b8b29ae775ad8c2e48c5391. You can confirm this by adding an empty file to a repository and using git ls-tree to see the contents:

(master) $ touch empty
(master) $ git add empty
(master) $ git commit -a -m "Empty"
[master (root-commit) 4145429] empty
 0 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 empty
(master) $ git ls-tree master .
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	empty

What git ls-tree is saying is that the master branch contains a file called empty whose permissions are 100644 (owner read/write, group+other read), and whose hash is e69de29bb2d1d6434b8b29ae775ad8c2e48c5391.

Similarly, if you look in the repository’s object store, you’ll find that a file .git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 has been created. The directory is split such that there are 256 different top-level names (00-ff) and the name of the hash is concatenated with the parent’s directory.

So, how does Git compute this value? Well, it uses SHA1 hash; but the SHA1 of an empty input isn’t this value. In fact, Git prefixes the object with "blob ", followed by the length (as a human-readable integer), followed by a NUL character, followed by the contents. So for our case, we have:

$ echo -en "blob 0\0" | shasum
$ echo -en "blob 0\0" | openssl dgst -sha1
$ printf "blob 0\0" | shasum
$ printf "blob 0\0" | openssl dgst -sha1

All of these print out the same value, e69de29bb2d1d6434b8b29ae775ad8c2e48c5391. Note that the \0 is the escape code for the NUL character; the -e to echo stipulates it should obey the escape. If you get fef5d… then it is interpreting the \0 as two characters, the \ and the 0. (And if you get be21… or b825… then it’s adding newline at the end.)

Instead of calculating this format ourselves, we can use git hash-object to calculate a hash – or, with -w, insert an object in our local repository:

(master) $ echo 'Hello, World!' | git hash-object -w --stdin
(master) $ ls .git/objects/8a
(master) $ echo -e 'blob 14\0Hello, World!' | shasum

This has created a hash of our object (blob 14\0Hello World!\n) and written it into the objects directory under the same name. The contents are compressed with the DEFLATE algorithm; but at the moment, it’s not used or referred to anywhere in our tree. Although we don’t see it in the working directory, we can see it in the repository itself:

(master) $ git show 8ab686eafeb1f44702738c8b0f24f2567c36da6d
Hello, World!

Next time, we’ll look at how Git organises objects into directories, and ultimately, commits.

Come back next week for another instalment in the Git Tip of the Week series.