Alex headshot

AlBlue’s Blog

Macs, Modularity and More

Git Tip of the Week: Git Submodules

2011, git, gtotw

This week's Git Tip of the Week is about git submodules. You can subscribe to the feed if you want to receive new instalments automatically.


In the previous post, I wrote about a tounge-in-cheek extension called git bigjobbies; the purposes of which was to store large binaries in a repository without them being part of the main branch (and thus, not appearing in the history).

The reason one might want to do that is to avoid a checkout from the server taking a significant period of time, or taking up additional space once the clone has been made. Although Git provides a good delta encoding for existing files, these tend to only work if the binary data is relatively similar. Typically, large binary assets (such as audio files, movies or even images if they've been saved in a compressed format) share little of the same binary data under the covers.

Git also has a configuration variable, core.bigFileThreshold, which can be used to set the limit at which files are stored as-is without performing any delta comparisons. Files above 512Mb (by default) are stored without any delta compressions to previous versions (though they are deflated at storage time).

The obvious solution to this problem is to store source code (and other compressible assets) in one Git repository, and then store large media assets (sound effects, in-movie videos etc.) in another Git repository. The history of one will therefore not affect history of another.

Submodules

If you're storing these as separate git repositories, how do you ensure that they are kept in sync with each other? Well, you could use tags and rely on convention to ensure that you can acquire the same version of the assets. However, tags can change (although they're not supposed to) and conventions can be circumvented.

Another way to do it is to store a pointer to the assets. (This is similar to the .bigjobbies file suggested before.) Since they are referenced by hash, as long as you can acquire the hash then you will be able to restore the asset.

Git submodules works these two concepts together, by treating a submodule as a logically checked out directory in another repository, but referring it to it by a pointer rather than a full checkout. The submodule (sub repository) can evolve at its own pace, with its own checkouts, and the parent can refer to it by a fixed hash.

Working with submodules

To add a submodule to an existing project, run git submodule add to define a local directory corresponding to the remote Git project's contents. For example, if you wanted to add the BigJobbies project earlier as a submodule, you could do:


$ git init parent
Initialized empty Git repository in parent/.git/
$ cd parent
(master) $ git submodule add http://github.com/alblue/BigJobbies/
Cloning into BigJobbies...
done.
(master) $ ls -AF
.git/		.gitmodules	BigJobbies/
(master) $ cat .gitmodules 
[submodule "BigJobbies"]
	path = BigJobbies
	url = http://github.com/alblue/BigJobbies/
(master) $ git status
# On branch master
#
# Initial commit
#
# Changes to be committed:
#   (use "git rm --cached ..." to unstage)
#
#	new file:   .gitmodules
#	new file:   BigJobbies

Note that this has set up a .gitmodules file and created a BigJobbies directory, corresponding to the BigJobbies cloned data. However, in the git status, it shows up as a file. What's up with that?

If we add the contents, commit, and then look at the tree, we'll get our answer:


(master) $ git commit -m "Added BigJobbies submodule"
[master (root-commit) f34f140] Added BigJobbies submodule
 2 files changed, 4 insertions(+), 0 deletions(-)
 create mode 100644 .gitmodules
 create mode 160000 BigJobbies
(master) $ git ls-tree HEAD
100644 blob 8041b87daf8e7ed034c669c6c5af9d63367dcd78	.gitmodules
160000 commit e9ed329101157ce9be5dc1c2639096bd82d3fa05	BigJobbies
(master) $ (cd BigJobbies; git rev-parse HEAD)
e9ed329101157ce9be5dc1c2639096bd82d3fa05

Instead of a simple mode 100644, which is used for storing a file with rw-r--r-- permissions, 160000 is used instead. This points to a commit, unlike the tree-or-blob that we've seen before. The commit points to the current version of HEAD in the checked out submodule, which as can be seen here is e9ed329101157ce9be5dc1c2639096bd82d3fa05.

The parent repository is now pretty slim; it contains the .gitmodules file and nothing else. However, it is also versioned in lock-step with the BigJobbies repository. Anyone who wants to clone this repository will find they can resolve the repository, albeit with a separate step:


(master) $ cd ..
$ git clone parent clone
Cloning into clone...
done.
$ cd clone
(master) $ ls BigJobbies/
(master) $ git submodule sync
Synchronizing submodule url for 'BigJobbies'
(master) $ ls BigJobbies/
(master) $ git submodule update
Cloning into BigJobbies...
done.
Submodule path 'BigJobbies': checked out 'e9ed329101157ce9be5dc1c2639096bd82d3fa05'
(master) $ ls BigJobbies/
LICENSE.txt	Movies		README.md	git-bigjobbies

In other words, we can clone the parent without acquiring any of its children. However, to populate the child submodules, we need to run a git submodule update command, which brings in the new code. (You also need to run the update when the remote repository has changed contents which you want to acquire as well.)

Parent-child relationships

Sometimes you want to be able to couple two repositories together, such as a game development project with its media assets, or a set of binary releases with a source project. It's tempting to think of these relationships as the binaries being part of the source project (or a submodule), or the media assets as part of the game source (or a submodule).

However, it's often better to reverse the dependency links between these sorts of repository dependencies. In other words, instead of a having a source repository with a child submodule of the binary assets, have a binary assets repository with a submodule of the source.

Flipping the relationship in this way allows you to treat the source repository as a standalone unit, which doesn't need references to the large binaries, but permits a full checkout of the parent repository (which does have the binaries).

For projects where the source has no need for the binaries (like in the precompiled packages for open-source projects) this distinction can save references to upstream binary repositories which may get accidentally checked out (especially if other submodules are used).

It's also possible to put the source and the binaries in two completely independent repositories, then knit them together with a higher level git repository (with two submodules). The parent can then be used as a top-level 'release' repository, whilst still allowing the binaries and source code to be acquired independently.

Finally, one advantage of having the binary (larger) repository being the parent, is that it will still work if you clone it with git clone --depth 1. When you use the --depth 1 flag, you're essentially saying that you don't want any of the history, just the latest commit on that branch. The latest commit will have a pointer to the source code's branch (which will have the full history) and so this permits you to check out a single (latest) version of the binary with access to the full source code's history.


Come back next week for another instalment in the Git Tip of the Week series.