Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Mercurial and Git: a technical comparison

2011, git

Disclaimers

Some disclaimers up front: no, this isn’t supposed to be a WhyXIsBetterThanY comparison, and no, I’m not going to say which one is “best”, because the best tool is the one you know how to use effectively.

Like touch-typing (or using a Dvorak layout), learning a new skill-set takes time and will often make you slower, at least initially. Only you know whether it makes sense to learn a new tool.

What I do want to dig into is how Git and Mercurial work under the covers, and some of the differences between the two tools. I’ll also dig into some of the characteristics that this leads to. The intention to provide qualitative data (which is repeatable) so you can also perform the same tests if you want.

Finally, I will try to avoid natural bias towards Git, which is both my preferred and common usage tool. I declare this in advance so as to pre-warn you to be on the lookout for any places I may have slipped up in an unbiased report.

Distributed Version Control Systems

All distributed version control systems work on roughly the same principles; rather than incrementing version numbers on the file (CVS) or repository (SVN) level, files are versioned by a hash of their contents, usually SHA-1. Directories are some kind of hash-of-the-hashes, up to the top level of the project, and there will be an overall commit which relates this change to others (in particular, the parent commit).

The two most popular DVCS are Mercurial and Git, but they’re not the only ones. Bzr, Darcs, and many more you probably haven’t heard of. Partially the adoption of these two is driven by the communities behind it, but also the availability of hosting services such as GitHub and Bitbucket.

Test data

To perform testing on equivalent versions, I am using the contents of my MacZFS project on GitHub. Thanks to the http://hg-git.github.com/, I can take the same dataset and check it out as a Mercurial or Git repository:

$ mkdir /tmp/hg-test; cd /tmp/hg-test; hg clone git://github.com/alblue/mac-zfs.git
$ mkdir /tmp/git-test; cd /tmp/git-test; git clone git://github.com/alblue/mac-zfs.git

The amount of time taken to clone the Hg repository is not indicative of normal Hg cloning speed, as it has to check out a Git repository locally, convert all the revisions into Hg format, and then serve that. In addition, the initial cloned repository size may not be optimal; we can fix that by cloning the local repository. Both operations complete within a couple of seconds:

$ du -sh /tmp/git-test/mac-zfs
37M
$ git clone /tmp/git-test/mac-zfs /tmp/git-test/testdata
$ rm -rf /tmp/git-test/mac-zfs
$ du -sh /tmp/git-test/testdata
37M

$ du -sh /tmp/hg-test/mac-zfs.git
58M
$ hg clone /tmp/hg-test/mac-zfs.git /tmp/hg-test/testdata
$ rm -rf /tmp/hg-test/mac-zfs.git
$ du -sh /tmp/hg-test/testdata
37M

I will use the testdata variant hereon to give the same performance numbers.

Repository Numbering

Although both Git and Hg allow you to refer to any version of the repository as a hash, Hg also allows you to refer to changes by global repository version number. Thus whilst MacZFS has a tag onnv_74 in both repositories, the Hg version is shown as 922:0911027531d5 whilst Git’s representation is shown as fe44926a203ae5de6d18b8563e30b0f496bbb787.

There’s no doubt that the ever-incrementing segment is easier for a human to initially understand or read; but like a Subversion repository number, this version doesn’t tell you much about the repository at all. Consider a bug fix against an older tag (say, onnv_73). If I create a branch to fix it, my Hg repository number for the fix will be 1057 (at the time of writing), which will clearly be more than the 922 of the onnv_74 tag. Even though there’s a number, it’s not necessarily useful for ordering even within a same repository.

The other problem with the Hg repository number is it’s specific to that particular clone of the repository. If I clone another copy, that has its own life of version numbers. On the other hand, for both Git and Hg, the hash is portable between all clones of that repository. The repository version number is a bit like training wheels on a bike; you use them to gain confidence in understanding the system, but once you understand how DVCSs work you rarely go back and use them again. Part of why Hg is seen as easier to use than Git is because those migrating from SVN use the repository number as a crutch; either way, you don’t continue to use it once you become more experienced.

Abbreviation

Remembering a hash is non trivial, so it’s possible to use abbreviations of them where uniqueness can still be inferred. In the example above, you can access the commit with git log fe4492, and in hg you can execute hg log -r 091 or hg log -r 922. (The ‘local’ revision number is 922; the hash starts with 091.)

Implementation

Git is largely implemented in C, whilst Hg is implemented entirely (or almost entirely) in Python. This allows Hg to be installed on operating systems using easy_install, whilst Git requires binary compiled downloads to be installed. It also means that the version of Mercurial is always available on all platforms, whereas Git may need recompilation/repackaging to be available for your local operating system.

Git is not solely built in C; rather, it has a well-defined metadata layout. As a result, it is possible for programs to be written in other languages and understand the same layout. For example, the Eclipse EGit project is a re-implementation of the core structures in Java. Mercurial, on the other hand, tends to be implemented solely in Python, and other tools built upon Python as well.

Git has a number of other helper programs written in other languages, notably Perl. Since the on-disk layout is standardised, tools can be written in other languages and manipulate the on-disk structures and remain compatible with other Git tools.

The implementation in Python has led to Mercurial having better support on non-Unix platforms such as Windows in the past. Ports based on Cygwin and Msys have provided Git support for Windows and now EGit, part of Eclipse Helios and Eclipse Indigo, have made Git available on a wider variety of systems than before.

Philosophy

There is a subtle difference between the philosophy of both Git and Hg. The former allows direct mutable access to the contents of the repository and the history, whereas the latter allows access to the history but only on a read-only basis.

Re-writing history can be dangerous. If history, which has been published to external sources, is subsequently re-written, then clients which have acquired that data will be negatively affected by any history re-writing.

Typically, history re-writing does not occur. However, there are situations where history re-writing may be desirable, such as when a password is inadvertently committed to a repository. In this instance, it may be necessary to re-apply the changes to the repository but without the password change stored. Since this re-writes history, the hashes change and thus clients who have previously cloned need to resolve the conflict with their local history.

There is no doubt that re-writing history when it has been shared publicly (e.g. to clients) has a non-negligible impact. However, there also times when it becomes reluctantly necessary. In this scenario, Git provides tools to enable this happening (through rebasing).

It should be noted that Mercurial has the notion of patch queues which permit some level of re-writing. However, these re-writing are intented to be applied before pushing to a shared repository.

On-disk layout

Where the two tools differ significantly is in their on-disk representation. This, in turn, has some impact with reference to performance.

Git stores all of its metadata inside a .git folder at the top level of the repository. None of the subfolders have metadata inside them. Objects are stored by their hash, either in a loose form, or in a packed form. Loose objects are stored uncompressed in a directory/name comprised of the hash of the contents. A loose object with hash 12345... is stored in a file .git/objects/12/3456.... If an object is not loose, it is stored in a pack file, which consists of a .git/object/pack/pack-xxxx.{idx,pack} pair. The idx stores an index, and the pack stores the data itself. Indexes index into the corresponding pack file, and explain how to extract an object (with a given hash) from the data file itself. The pack is a sequence of diff-like operations applied recursively to other members of the pack file; but unlike (say) a CVS diff, the diffs themselves can refer to shared content.

Mercurial has a .hg folder at the top level of a Mercurial repository. Unlike Git, instead of a fixed layout, Hg mirrors the repository’s layout under the store/data directory. Directories with a dot are translated to a __ pair; each file will result in a .i (or revlog) file. This represents a series of diffs for each specific file in the repository.

The net effect is whilst both support diffs for textual files in a fairly well defined manner, neither work well for binary files. This is a general rule for distributed version control systems, which must naturally store all versions of all previous commits; if diffs cannot be calculated easily, then the diffs will result in a large file system.

Mercurial is more predictable in file system usage than Git. Each file – and therefore its diffs – are stored in a separate file, meaning the size of the repository grows linearly with the size of the diffs. Git, on the other hand, stores files as both unpacked (loose) and packed objects, depending on the state of the repository. Git has a gc command, which can convert a number of loose objects into a packed (and thus smaller) representation; whereas Hg does not need this operation. Recent version of Git invoke gc on a periodic basis as required; so unless you are monitoring the size of the repository on a commit-by-commit basis, you may not notice this difference.

The on-disk representation has a couple of important factors. Firstly, shared contents between files are much easier with Git (since the packed representation can share blobs of data between files, whereas Hg stores them separately). To confirm this, run:

$ mkdir /tmp/hg-test/helloworld
$ hg init /tmp/hg-test/helloworld
$ for i in {1..100}; do echo "Hello World" > /tmp/hg-test/helloworld/$i; done
$ hg add /tmp/hg-test/helloworld/*
$ hg commit /tmp/hg-test/helloworld -m "HelloWorld"
$ du -sh /tmp/hg-test/helloworld/.hg
436K /tmp/hg-test/helloworld/.hg

$ mkdir /tmp/git-test/helloworld
$ git init /tmp/git-test/helloworld
$ for i in {1..100}; do echo "Hello World" > /tmp/git-test/helloworld/$i; done
$ cd /tmp/git-test/helloworld/
$ git add .
$ git commit -m "HelloWorld"
$ du -sh /tmp/git-test/helloworld/.git
 96K /tmp/git-test/helloworld/.git

Although a contrived example, this highlights that Git can share contents between files whereas Hg does not. Whilst this may not be immediately practical in general, some files tend to have shared contents (such as a LICENSE header) on a per-file basis. Similarly, when files are renamed (or moved) Git can track the contents across file names better than Hg.

Branching and tagging

Branching and tagging are handled differently in Mercurial and Git. In Mercurial, branches and tags are an intrinsic property of the repository, whereas in Git they are an extrinsic property of the repository tree.

As a result, all clones of a Mercurial repository must share the same (ancestral) tag/branch combinations, whereas a Git repository this is not necessarily the case.

The reason for this difference is that Git stores both tags and branches outside the repositories contents as labels which point to a specific commit hash. Tags (or branches) can be created at will on the local system pointing to a specific commit (which stores a linked list through all its parents to the root of the repository). There is no significant difference to Git between a branch and a tag; they are stored as files in .git/refs/heads/name and .git/refs/tags/name that contain a commit hash identifier. Both branch and tag references can be pushed to (and pulled from) a remote repository, but this is orthogonal to the content which is pushed.

By comparison, Mercurial repositories are much more sensitive to both tag and branch names. The individual file change revlog (the .i mentioned earlier) contains the name of the branch itself, and the tags are stored in an .hgtags file which is distributed along with the repository.

This means if you pull a Mercurial repository, you have no choice but to pull all of its tag references as well. Any conflicts (such as newly added tags) require a repository merge operation.

Revlog vs pack files

Mercurial stores its deltas in revlog formats, which are essentially a sequence of diffs for a particular file. Git stores its deltas in a pack file, which stores diffs over a set of files.

As a result, Git can be more efficient when files are renamed, contents moved, or files share similar content. In the case where all files have unique content, Git and Hg will store a similar set of data.

Hg revlog files were initially designed to be read in increasing delta order; so changes that consist of incremental changes may be efficiently read on a file-by-file basis, as discussed in the revlog paper. However, Git pack files can both subsume revlogs as well as providing deltas across and between versions of the same files and separate files. Whilst the revlog format is more predictable, the git pack format may be encoded in arbitrary ways to provide the same result, so tends to vary over time.

Both Hg and Git store revlog/pack files as a tree of deltas. for example, a series of files in Hg may be stored as a1-a2-a3,b1-b2-b3. Git, however, may choose to store a1-b1-a2-b2-a3-b3 if this is a more efficient mechanism than a1-a2-a3,b1-b2-b3 (which is also supported). As a result, a Git packfile may be smaller than an Hg revlog if there are either non-linear ways of representing the chagne, or shared contents between files which may be shared in a single packfile. Git’s transient store of multiple loose objects may result in a larger store of a set of files than Hg.

Cross pollination

It should be noted that features available in Git have influenced Hg (and potentially vice versa as well). For example, Git’s bisect, which allows a tree to be iteratively bisected to determine where a bad change was introduced, has subsequently been added to Mercurial. Similarly, Git’s general rebasing has been ported to Mercurial in the form of the patch queue. From the other direction, Mercurial’s bundle has been ported to Git.

Mutable History

Since Hg’s on-disk layout is stored per-file and as an incremental series of changes, changing history is more problematic in Hg than it is in Git.

Since Git’s concept of branches (and tags) are simply a hash of the current tree, creating (or updating) git branches and tags results in the creation (or update) of a single file. This file contains the hash (as a text file) pointing to a new part of the tree. Changing history – whether the contents of a commit or merely the commit message – generally involves little changes to the repository data, but merely the metadata that points to it.

Mercurial, on the other hand, tends to require more files to be updated when historical changes are made. Particular when file renames occur, specific metadata is needed to record a rename; whereas in Git, since both files have the same hash, the only difference is the tree parent which has a new name pointing to the same contents.

Git’s rebase allows a sequence of changes to be replayed onto a different branch, or (in the case of an interactive rebase) to be re-ordered. Conflicts caused by the reordering are resolved in the same way that other conflicts are; in most cases, the commits can be reordered without causing conflicts. Furthermore, Git’s rebase allows multiple commits to be squashed into one commit; in other words, reducing the set of operations into a single delta. (Mercurial has plugins to collapse and histedit as two separate operations, but they’re not distributed with Mercurial itself.)

Git’s flexible history encourages developers to commit frequently and often; prior to pushing to a remote repository (where the history becomes shared) the changes can be squashed into a single commit and uploaded as a single change instead of recording the individual atomic changes). This encourages iterative, frequently-committed changes even if the changes are just a work-in-progress.

Git also has a cherry-pick concept, which permits an out-of-order commit to be copied from one branch to another. This is useful for copying hotfixes from one branch to another without requiring a full merge node. (Mercurial has a plugin transplant which is a recent addition.)

Performance

So far we have covered the basics of the repository systems, rather than performance. Here are some qualitative measures of the testdata repository acquired above:

MeasurementGitHg
time git/hg log0.013s0.253s
time git/hg checkout onnv_181.221s1.379s
time git/hg diff onnv_720.820s0.079
time git/hg checkout onnv_720.828s2.670s
time git/hg diff onnv_72..onnv_180.745s1.211s

In these examples, the repository was initially at master/default and scrolled back to and then up to a revision on a linear history. Hg clearly has an advantage when moving diffs forward but performs less well when moving backward or jumping to arbitrary versions.

Parallel access

If doing work between multiple revisions, it is useful to be able to compare like-for-like across different revisions. Git allows this with the git show and git ls-tree commands, which allow a single file/commit or a directory to be investigated. Since a revision implies a specific file hash, comparing diffs between individual hashes will show the differecnes between the files.

Mercurial allows an individual file to be compared between revisons, or a repository (or folder) as a whole. Individual files themselves are not addressable, other than as a (repository,path) pair.

If one needs to work across comparison across multiple branches at the same time, then one approach is to just clone the repository and switch branches in the clone. This permits multiple branches to be chceked out at one time and works the same way for both Git and Hg. Since Hg has slightly heavier local branching than Git, using separate clones are a practical way of investigating multiple branches at once.

Since Git branches are only a hash pointer into the object/pack strucutre, changing to a different branch in Git is a significantly cheaper procedure than it is in Hg.

Git specifics

Things for which no similar equivalent exists in Hg.

Git index

A significant difference between Mercurial and Git is Git’s index. This permits a work-in-progress to be built up over multiple (add) operations, prior to committing the operation. Together with add -p it can be used to iteratively add changes to a work in progress (such as a large merge operation) prior to executing the change set.

The problem with the git index is that it is rarely needed when learning Git in the first place, and tends to get in the way initially. However, when it is used – such as a large merge with multiple conflicts – the power becomes tremendously useful. Like many skills in life, like first aid, appreciating it is not the same as being able to use it if required.

Git Reflog

Since Git effectively stores where you are as a commit hash, and that commit hash moves as you update a branch, it can also record where you have been. The git reflog stores a list of places for which HEAD has been used over the past iterations of the (local) repository. HEAD is used to indiciate the position of the current branch, but since you may have switched between multiple branches the reflog provides a history of each of the revisions. Given that the revision is a commit, which points to a tree, each version shown by the reflog can be transitively resolved to a set of fixed files.

Git ls-tree/show

Files in Git are stored as commit-tree-tree-…-tree-file, and each level can be introspected with either git ls-tree or git show. For example, running git ls-tree onnv_72 returns a reference to 2a00fe6 (under the usr path); this, in turn, has 40564 (src) and so on. Using the -r flag we can recursively list the contents, in other words:

apple:MacZFS alex$ git ls-tree -r onnv_72
100644 blob 0ab3c2b8f09afeea3e5c73aa9f69a92ca2dd2374 usr/src/cmd/zdb/Makefile
100644 blob 46c948ea64a91c7fb12f6ada5a09004c9c2b4673 usr/src/cmd/zdb/Makefile.com
100644 blob c2f8b37b5d962db09a60ac03d57bb3d23f7491f2 usr/src/cmd/zdb/amd64/Makefile
100644 blob 5c93bf6ac6b6ed553e1f887f72d31de288fd05c8 usr/src/cmd/zdb/i386/Makefile
100644 blob bb65300ccae99ecd992c67c7c70374c59e9bb29d usr/src/cmd/zdb/inc.flg

Each one of these files can be introspected with git show:

apple:MacZFS alex$ git show 0ab3c
#
# CDDL HEADER START
#
# The contents of this file are subject to the terms of the
# Common Development and Distribution License, Version 1.0 only
...

This makes it trivially easy to find any version of any file, wihtout having to affect the working directory of the current checkout.

Note: it’s possible, using both Git and Hg, to show the status of a file at a given revision, as noted in the comments. For example, you can use hg cat -r onnv_72 usr/src/cmd/zdb/Makefile or git show onnv_72:usr/src/cmd/zdb/Makefile. My point wasn’t that you could (or should) do the blob approach to extract files, but rather, it is a general-purpose API that lets you view the tree of data that Git manages. The way you show a file is the same way you can show a directory listing, a commit node, an annotated tag or (in recent Git installs) git notes. It is this addressability of data, along with low-level commands like show, which permits higher-level tools to be written in other scripting languages. Git uses the terms ‘porcelain’ and ‘plumbing’ to distinguish between the tools that operate on or tools that extract the low level data.

Summary

Both Hg and Git provide a distributed way of representing files and their changes, and can be used to synchronize with other remote repositories (such as BitBucket and GitHub). Both are more powerful than CVCS systems such as SVN and CVS because the full history is available locally.

Git provides a view of the tree using a hash system, whereby a hash corresponds to a specific file or tree. Creating branches or tags involves adding (or changing) a pointer to an element in the tree.

Mercurial works on a file difference system. Each file has its own history log (revlog) which can be used to quickly roll forwards revisions of the same file, but suffers in the face of shared data across files or of renames of existing files.

The fact that the core Git implementation is written in C whereas Mercurial is written (mainly) in Python leads to better platform portability for Mercurial, particularly with Windows systems. However, cross-platform solutions are available and Git’s more flexible object model permits other solutions (such as bug-trackers) to be stored within the repository itself.

When moving to a DVCS repository, it can be attractive to have monotonically increasing version number (much like subversion has) but this luls the user into a false sense of security which is qiuckly overcome in the real world. The advantage of ahving this feature is only typically experienced in a learning or training role.

The Git index may seem unecessary at first, if only because other version control systems do not offer that functionality. However, it has proved useful with specific use cases that woud be a lot more difficult if it were not present. This advantage is rarely seen at the initial phases of using Git but becomes more apparent as the expert use cases start to become more popular.