Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Git Tip of the Week: Index Revisited

Gtotw 2011 Git

This week's Git Tip of the Week is about indexes. You can subscribe to the feed if you want to receive new instalments automatically.

In last week's tip we visited the purpose of the index. But what actually is it?

It's not actually a tree object, as I alluded to last time. That is, you can't iterate the contents with git ls-tree. It does point to blobs in the object database, however. So why do we need a different type of object to refer to the index?

Some of the reasons are performance oriented. Whenever you do a diff (or other repository-wide operation), Git needs to quickly and efficiently compute whether the state of the working tree has changed since the last index. Some tools, including the bash shell prompt, need to be able to determine if the working tree is dirty or not quickly:

(master) $ git status
# On branch master
# Changes not staged for commit:
#   (use "git add <file>…" to update what will be committed)
#   (use "git checkout -- <file>…" to discard changes in working directory)
#	modified:   example
no changes added to commit (use "git add" and/or "git commit -a")
(master) $ export GIT_PS1_SHOWDIRTYSTATE=true
(master *) $

If Git only knows whether the tree is dirty by doing a full walk of the contents (and calculating their SHA1 hashes), this operation would be prohibitively expensive. Fortunately, Git has a number of optimisations that allow it to avoid this case.

The index stores not only the file names, but also the last modification time of those files. As a result, Git knows whether there have been any changes to timestamps, by iterating through the files' metadata and comparing the timestamps with those in the index. If a file is missing from the index, it's represented as an addition. If a file is missing from disk, it's represented as a deletion. If a file's modification time is different then this is represented as a modification.

As well as storing the timestamps, the index also stores the SHA1 hashes of each blob. This allows the index to update itself, should the file be reverted to a previous state but with a later timestamp.

Finally, the index is also used for processing merges. In the index, there is a concept of having multiple index numbers (or stage numbers). Normally, only 0 is used since this represents the state of the current working tree. However, if a merge conflict arises, then the index is used to disambiguate the state of the files at each level. If you have a conflict, then stage 0 is used to represent the current working tree, stage 1 is used for your change, then stage 2 and 3 for the other differences. You can see the stage number by running git ls-files --stage (or -s):

(master) $ git status
(master) $ git ls-files -s
100644 ce013625030ba8dba906f756967f9e9ca394464a 0	example
(master) $ git pull # with known conflict
Auto-merging example
CONFLICT (content): Merge conflict in example
Automatic merge failed; fix conflicts and then commit the result.
(master|MERGING) $ git ls-files -s
100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 1	example
100644 a895c0db0a627cb9451ae390a2a0922495dbb161 2	example
100644 13940d48c3d3693113f7543d4fc5423a916ef55d 3	example

The stage 1 file here is the place the two files derived from, so we can do 1..2 and 1..3 diffs to find out what each side of the tree has changed since. (The more astute of you will recognise e69…391 as the empty file.) We can show the contents of these versions of the files, or load them into a 3-way diff tool if you have such a thing:

(master) $ git show e69de #empty file
(master) $ git show a895c
Left tree
(master) $ git show 13940
Right tree

Of course, normally Git will handle the diffs for you and you don't need to worry about the specific changes, nor extracting the contents out of the index. But it does highlight the fact that when you have finished doing a merge of a single file, running git add on the file puts a copy in stage 0 of the index, removing the 1,2,3 indexes:

(master|MERGING) $ git add example
(master|MERGING) $ git ls-files -s
100644 319c128291474d30f48e721ca87bd10425e8e296 0	example

This is why merging large conflicting changes with Git is easy. Each file can be addressed on a file-by-file basis; when you have finished merging a file, you can add it to the index, which records both its contents as well as removing the other files from the merge status. Merging many files then becomes an exercise in merging them one-by-one, and adding them as you go. And since they're all transiently stored in the index, you can keep adding them until you are ready to perform a git commit.

Come back next week for another instalment in the Git Tip of the Week series.