Tuesday, December 20, 2011

Git Tip of the Week: Finale

References

Over the past nine months, I've been writing a series called the Git Tip of the Week series, where I write about a Git-related article every week. Part of this has been a desire to learn more about the way the Git internals works; part is to provide a reference for others to find out about as well.

However, all good things must come to an end, and writing a weekly post, whilst keeping it fresh, is a non-trivial task. In addition, finding something new (or even vaguely interesting) to write about is increasingly difficult once you've covered the standard cases and a number of esoteric ones.

So, for my final post in the series, rather than writing about something new, I thought I'd link back to the ones that I've written before, in order. Thus, if you want to share the series with others, you can refer back to this index page as a means of finding out what I wrote when. The search list is all well and good, but it only shows the top 20 most recent posts. In addition, I've added some of my other Git related articles that may be of interest, even though they weren't part of the Git Tip of the Week series.

I'd like to thank you for your time and interest in reading this series, and wish you a happy Christmas and a prosperous New Year.

Tuesday, December 13, 2011

Git Tip of the Week: Forking and Pulling vs Pushing

References

This week's Git Tip of the Week is about the GitHub generation. You can subscribe to the feed if you want to receive new instalments automatically.


Distributed Version Control Systems have really taken off in the last few years, though they've been around for over a decade. Probably the biggest growth spurt happened because of the controversy that launched Git back in April 2005, providing a rock solid distributed version control system, modelled on a filesystem. In only a few months, Git began hosting the 2.6.12 Linux kernel source.

However, as popular as Git may have been then, it wasn't until the birth of GitHub until Git really took off. Founded in February 2008, GitHub brought Git to a much wider audience and provided a free hosting site for public Git repositories (as well as commercial plans for private repositories.) It has been argued that GitHub is one of the reasons why Git has taken off faster than others, like Hg and Bzr.

What GitHub brought was a focus on a new model; instead of creating patches (as covered last time), GitHub encouraged universal forking of the repository. So, if you want to add a change to an existing repository, you can fork it (and create your own clone), make the changes, and then send a pull request.

There's nothing significant about pull requests in the Git workflow as compared with other DVCS tools. Pushing and pulling are two key primitives in a DVCS workflow, after all. But what was novel about GitHub's approach was the way that pull requests could be sent, as an out-of-band message to the upstream repository owner suggesting the idea.

Not only that, but the upstream owner would then get a notification and be able to view the request in situ, and with the diffs as appropriate (outside of a mail client, via the web interface). Subsequent advances, such as the ability to fork-to-fix-typos, meant that anyone could suggest changes via the web without even needing to compile the code locally.

Pushing, Pulling, Patching or Proposing

As a result, Git repositories can end up with different workflows depending on the type of project and hosting environment you are using. They are:

  • Pushing: You have access to directly write into the repository, so you just push your changes
  • Pulling: Someone has changes locally and asks you to pull the change from their repository
  • Patching: You send the diffs/patches by a transport mechanism (bugzilla, email) for consideration
  • Proposing: You use a tool like Gerrit to propose changes to which can then subsequently be merged

Each project has potentially a different style of operation, and there isn't a "right" way to use a Git repository. GitHub, for example, strongly favours the Pulling model when consuming changes from others (though of course, the repository owner can do pushing directly). The Linux Kernel, both for historic reasons and also for transparency and open discussions, chooses the patching (by e-mail) model.

The final one – proposing – is a combination of both the pull, push and patch models. They're similar to GitHub's pull mechanism, in that the project's owners can see a list of all incoming changes and decide which ones to use; but the push-based upload means that the original repository doesn't have to be forked on the remote server. And finally, tools like Gerrit (which I've mentioned before) can be used to generate patches, host in-situ discussions, and even act as a Git repository for consuming by standard git fetch protocols.

GitHub's pull-based approach has certainly had a wide impact on the number of users willing to try that method. They have a note on collaborative development models on the subject:

  1. The Fork + Pull Model lets anyone fork an existing repository and push changes to their personal fork. This model reduces the amount of friction for new contributors and is popular with open source projects because it allows people to work independently without upfront coordination.
  2. The Shared Repository Model is more prevalent with small teams and organizations collaborating on private projects. Everyone is granted push access to a single shared repository and topic branches are used to isolate changes.

Certainly, if there are minor changes (like a typo in documentation) the fork-and-pull model, when combined with a web-based interface, can make things dramatically easier for contributors. Instead of having to need to create accounts on bug tracking systems (or tools like Gerrit), the repository can be forked, fixed, and a pull request fired off to the repository maintainers. With the merge button in GitHub, it can often be the case of allowing the fix to be merged in without the maintainer having to check the code out at all, if it's sufficiently simple. Reducing the barrier to accepting changes helps keep an active open-source project alive and open to all.

The only problem with the Fork + Pull model is being able to attribute changes by user. For example, some open-source foundations want to ensure that any changes are granted against an existing open source license (Apache or Eclipse, for example). Other projects tend not to be as strict and will happily accept contributions from anyone, with the assumption that any contributors have agreed to the license. One additional service that patches-to-bugzilla or gerrit push provide is in the acceptance of a contributor agreement, which normally states that the individual is entitled to grant the code under the specific licence. One of the side-effects of creating an account often implies (explicitly or implicitly) the agreement to follow that foundation's licensing rules.

So, there's no "right" way to do Git; different teams, foundations and projects will have their own preference for working with a particular strategy, and may evolve over time. Instead, it's useful to know what's available so that the right choice for that project can be made, understanding the different flows available.


Come back next week for another instalment in the Git Tip of the Week series.

Tuesday, December 06, 2011

Git Tip of the Week: Patches by Email

References

This week's Git Tip of the Week is about how git handles patches by email. You can subscribe to the feed if you want to receive new instalments automatically.


One of the main benefits of a distributed version control system is that code changes can be pushed from one repository to another clone, and all dependent changes are pushed as well. However, that only works when you have write access to the remote repository, which in many cases you do not. One way of getting changes is by providing a patch, or a set of changes which can be applied to a remote repository at the other end.

Git started life as a distributed version control system for the Linux project, which actively uses mail lists both as a discussion mechanism and also as a distribution mechanism for patches (changes) for an existing codebase. (New features are just a special case of patching nothing to add the new code.)

To speed the processing of patches by mail, git developed tight integration with both (command-line) mail clients and of the generic Unix mbox format. Patches can be generated in the form of mail messages, and the remote end can process them with a specific command to reconstitute the changes in the git repository.

Whilst the majority of projects don't use patches by mail as a change distribution mechanism, it is useful on occasion where either a patch needs to be generated and attached to a bug tracking system, or where changes need to be sent to a remote developer who doesn't have direct access (such as through a firewall).

The convention adopted by the git developers is to format one patch per e-mail message. The subject of the message then has the first line of the git commit, prefixed with a prefix that can be overridden on the command line but which defaults to [PATCH x/y] as a means of threading them together. (Amongst other reasons, this is why the initial line of a Git commit message is suggested to be relatively short, so that it fits with a mail client's view of the subject and suggested prefix.)

Generating and sending patches

How do we generate these patches? The git format-patch will generate a patch-file-per-commit in the range required, formatted ready to go as mail messages in mbox format. The --to can be specified for which mail address the patches should be sent to – but the sending is done separately.


(master) $ git format-patch --to cdt-dev@eclipse.org HEAD~..HEAD
0001-bug-333001-Description-Scanner-Info-doesn-t-release.patch
From 9c9c692df50e5a9eb91b41cc86f57212afd78ef9 Mon Sep 17 00:00:00 2001
From: Andrew Gvozdev …
Date: Sat, 16 Jul 2011 15:16:21 -0400
Subject: [PATCH] bug 333001: Description Scanner Info doesn't release
 ICProjectDescription
To: cdt-dev@eclipse.org

---
 .../cdt/internal/core/model/CModelManager.java     |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)
⋮

If the commit message has more detail than a single line, the detail will be included below the mail's subject headers. It's possible to add additional commentary below the commit message, before the patch is shown, and any text up until a combination of --- or >8 or 8< (AKA 'scissor lines') is ignored by the patch application at the other end.

These patch files can then be transmitted via mail using the git send-email command. This connects to the given SMTP server (either the one from your global ~/.gitconfig or the project's .gitconfig, or the one specified on the command line) and then sends each patch file as a separate e-mail:


(master) $ git send-email --smtp-server=smtp.gmail.com *.patch

As well as using format-patch in a separate stage, it's possible to use send-email to generate the patches and then send them immediately. (You can also configure send-mail to prompt to open an editor so that you can customise the messages before they are sent.)

Applying patches

Once the patches have been created, how do you apply them into a local clone? If you have a patch file, you can apply it with git apply:


(master) $ git apply 0001-bug-333001-Description-Scanner-Info-doesn-t-release.patch

Note, however, that this approach does not recreate the state of the world as it was on the sender's repository. Instead, the patch is applied but it only makes local changes to the repository's content instead; it does not recreate the commit (and more specifically, the hash of that commit). You can specify git apply --index and git apply --cached to get the changes put into the staging area, but this does not recreate the same commit as before.

To recreate the commit as it was exactly requires the use of git am, which stands for apply mailbox. This runs through a mailbox (which may have one or more patches in it) and recreates a commit for each one of those patches.

Fortunately, the output generated by git patch is already in mbox format; it's the purpose of the otherwise dummy From 9c9c692df50e5a9eb91b41cc86f57212afd78ef9 Mon Sep 17 00:00:00 2001 line at the top of the patch file. As a result, the patches can be treated as one message per mbox, and then applied in batch to the changes which get sent.

In fact, since mbox elements can be concatenated together, this permits patch files to be concatenated together to form a larger patch file, which can be sent as a single unit via another transfer mechanism and then applied on the remote side.

Bundles

Patches provide a way of reconstituting a repository over a not directly connected mechanism, but the purpose of patches are to enable humans to investigate the set of changes as much as getting the change there. If however the desire is to move commits from one machine to another without direct connectivity, a better alternative is to use git bundle.


(master) $ git bundle create changes.bundle HEAD~..HEAD
Counting objects: 23, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (8/8), done.
Writing objects: 100% (12/12), 935 bytes, done.
Total 12 (delta 5), reused 6 (delta 0)

The format of the bundle uses the same format as the network transmission that Git uses over the network when cloning. As a result, the references contained are only those listed in the reference list.

Typically, a tag will be used to mark where the last known point was for the remote source; then, the difference between HEAD and that tag is used to build up the bundle for the remote end. Alternatively, branches can be used to simulate the branch on the remote end.

Once the bundle file has been generated, it can be sent over any transport to the remote host for reconstitution. This might involve burning to a CD, via a USB stick or some other network protocol.

On the client side, the client can run git verify to determine if all required parent commits are present in the local repository. This must be run from the client git repository that you want to fetch into.

The client views the bundle as a remote that it can pull from, much like a path to a directory can be used to pull from a local file-based repository. You can add it as a remote (e.g. git remote add changes /tmp/changes.bundle) or you can fetch from the path to the bundle itself:


(master) $ git verify /tmp/changes.bundle
The bundle contains 1 ref
b707c559636bf8e6dffb3145bd44b03de18868b3 HEAD
The bundle requires these 1 ref
3580c1087c2860fbe6ca4c1a7a6d6e1eb1669aa3 Bug 333599 - [C++0x] Initializer lists & return without type
/tmp/changes.bundle is okay
(master) git fetch /tmp/changes.bundle
From /tmp/foo.bundle
 * branch            HEAD       -> FETCH_HEAD

Once the references have been fetched into the repository (which can be referred to as FETCH_HEAD) you can then inspect the changes, fetch/merge them into the local branches or reset your master branch to that of FETCH_HEAD.

Summary

It's not always possible to have write access to the repository you want to send changes to. In those cases you can send changes out of band, either via mail (if you want human reviews) or as a bundle (if you just want to send the commits).


Come back next week for another instalment in the Git Tip of the Week series.