Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Git Tip of the Week: Understanding the Index

Gtotw 2011 Git

This week's Git Tip of the Week is about indexes. You can subscribe to the feed if you want to receive new instalments automatically.


In the previous Git Tip of the Week, we looked at interactive adding; that is, the ability to add just parts of a file instead of the whole file as part of the commit.

It's about time we focussed on the staging area of Git, which we implicitly used when we added parts of the file last time. This is one of the main differences between Git and other version control systems, and often people are confused about its purpose.

The staging area of Git allows you to freeze a state of your working tree, such that the subsequent git commit takes that frozen state and uses it as the point of contact. Many version control systems only let you freeze at the point of commit and so don't have this intermediary stage; and when you are using Git for the first time, you will often have a habit of using the git add to be immediately followed by a git commit, or even setting up an alias to do everything.

So why does Git have the concept of an index, anyway? Well, remember that Git uses content-addressable files; in other words, when you have a specific piece of content (like the empty file) it always has the same identity – e69..391 – whatever the file is called. What happens when you run git add is that the object is added to the object database. As well as the object being added, it needs a pointer to point to it, so there's a virtual tree called the index which contains a tree, which points to the blobs contained therein.

When you add files (or rm files), you really end up modifying this tree which represents what you'd like to do next. When you run commit, it takes that tree, builds a valid tree object, and commits that to the database (as well as updating the branch, if any).

Although it may not immediately seem useful to have this feature (and some argue that this is an example of Git's complexity over other systems), it can be very beneficial for doing specific operations; for example:

  • Staging parts of a file to break it up into different commits (as last time)
  • Working with large merges, where many files may have conflicts (you can record which ones don't have conflicts, and ones that you've already worked with, but running git add; you're then left with a decreasing number of differences to process)

When you run git status, it tells you all you need to know about how the index corresponds to your current working tree, giving you different messages about the files that it finds. To speed up processing, Git usually uses timestamps to determine if a file has been changed, but in doing a full processing sweep will calculate the SHA1 hash of the contents of the files (and thus, the directories) to determine differences against the index:


(master) $ git status
# On branch master
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#
#	deleted:    deleteme
#	renamed:    same -> renamed
#
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#	modified:   changed
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#	addme

The contents of the index are really the snapshot of the files that have been modified. For example, the renamed and deleted above show changes which have been staged (i.e. added to the index) whilst the not staged changes have been modified but are not yet committed. The index also allows for quick identification of changes in the local repository which have yet to be added.

So, for a given file, there are possibly three separate copies whilst working on it. There's the previous version that was committed (i.e. HEAD), there's the current version on disk (the working tree) and a third copy, which is a cached version in the index. That's why, when you have a local change combined with one that already exists you might see the same file twice in the status message:


(master) apple[example] $ git status
# On branch master
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#
#	modified:   three
#
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#	modified:   three
#

You can use git diff to show you the differences between the two files:


(master) $ git diff
diff --git a/three b/three
index bb87574..3179466 100644
--- a/three
+++ b/three
@@ -1 +1 @@
-This is the version in the index
+This is the version in the working tree
(master) $ git diff --cached
diff --git a/three b/three
index 48d0444..bb87574 100644
--- a/three
+++ b/three
@@ -1 +1 @@
-This is the previously committed version
+This is the version in the index

In the second example, the --cached says to compare between the index and the previous commit (otherwise it's comparing the index and the working tree). You can, if you want, get the full contents of each of those files as long as you know the hashes (shown above in the diffs):


(master) $ git show 48d0444
This is the previously committed version
(master) $ git show bb87574
This is the version in the index
(master) $ cat three
This is the version in the working tree

The last one, of course, doesn't have a Git object yet by virtue of the fact we've not added it yet. If we were to add it, we'd replace the previous version in the index. We'll take a deeper dive into the index next time.


Come back next week for another instalment in the Git Tip of the Week series.