Wednesday, February 23, 2011

Reflections on Objective-C, Part II

References

In my last Reflections on Objective-C post, I put forward some of my thoughts about Objective-C's approach to memory management, specifically, the NSAutoreleasePools. This drew some flak from developers pointing out that reference counting is neither unique to Objective-C, nor the only way of handling memory management, both points which I agree with. However, my point was this is ubiquitous in Objective-C applications, which is not necessarily the case with other approaches cited in the comments. One commenter tried to brush the issue under the rug by claiming that NSAutoreleasePools are part of Apple's CoreFoundation rather than the Objective-C syntax. (Well, that's true; but technically, 'malloc' is not part of the C language syntax either; but it's present in all C libraries.) But let's put that behind us ...

In this post, I want to talk about Objective-C's approach to dealing with messaging objects, particularly with reference to the dynamic aspects.

Although Objective-C is a super-set of the C programming language, and is thus typesafe and compiled, in actual fact the dynamically of the language goes past traditional compiled languages to behave more like Smalltalk, upon which it is based. To understand how this works, it's worth reviewing how messages in Objective-C work.

Objective-C classes are split into interface and implementation sections. The interface exports the data layout for the class (under the covers – in the old ABI, at least – is simply a structure which contains an ordered set of strongly typed variables or ivars), as well as the public set of methods which are known to the class. The implementation contains implementations of these methods, but may also contain additional messages not present in the interface (for example, to implement private or internal functionality).

So far, this hasn't differed from how other languages may work (Java combines both the interface and implementation structures in one unit; an interface in Java corresponds to a Protocol in Objective-C). You can send a message (referred to as a 'method call' in other languages) to any object, and the compiler verifies that the message sent is known to the interface of the type of the variable. As an example, you can send the “length” message to an NSString, and get back the number of characters contained.

There's two key differences between other compiled OO languages and Objective-C, however. Firstly, it's possible to send a message to an object which it doesn't know about. That is, you can programmatically send “longth” to an NSString. If you define the target to be of type NSString, you'll get a compiler warning (though warnings are promotable to errors) saying that it “may not respond to 'longth'”.

Why would a feature which allows errors to creep in be considered a benefit? Well, as well as being referred to as as specific type (or one of its superclassses), there is a generic type called id. If you were to send “longth” to NSObject, you'd get a warning (much like you would with an NSString target). However, you can send anything to a target of type id without any compiler errors or warnings.

In either case, and whether the receiver's interface declares whether the message is known or not, the instance may have an implementation for the method. If the method is implemented, it will be invoked. This means you can send “count” to id objects, and if they respond to this message you'll get a result back. (In dynamic languages, this is referred to as 'duck typing'.)

Being able to send messages to objects which may not understand them might seem a little strange. However, this happens all the time in Java where “Method.invoke()” is used to invoke methods which may not have been known about at compile time.

This is used to great effect to implement a couple of patterns in Objective-C. The first is delegates, where standard UI components call back application code in response to specific events.

The standard UI framework classes obviously know nothing specific about the classes implemented by end application developers. However, they do call back with certain messages (like touchesBegan or windowShouldClose). In these cases, the delegate is typed as id, so any arbitrary messages can be sent. This is also one of the key reasons that the Objective-C frameworks are so extensible; instead of requiring arbitrary extensions to be defined and implemented, the framework can send a message whether it is implemented or not. As new features – and therefore methods – are added, client classes don't need to be recompiled.

The second pattern this allows is dynamic proxying. Java has a java.lang.reflect.Proxy which can implement a Java interface and forward all methods through a single handler. However, its use is often clumsy and if multiple methods are required then the invocation handler has to provide the switching logic itself.

In Objective-C's case, the base NSObject has a method called forwardInvocation:, which acts as a default method if the instance does not respond to the message sent.

This is used to implement Remote Messaging. This allows a thin proxy to intercept messages on a local machine, forward them over a network protocol, and then invoke the object on the remote side transparently. Unlike (say) RMI or CORBA, you don't need to generate any compile-time stubs or implementation ahead of time; it can be dynamically added to a class at runtime.

As with my previous post, this isn't meant to say that responding to dynamic messages or invoking messages can't be done in other languages; rather, the fact that it's built into the base NSObject runtime means that every object can take advantage of this runtime functionality. And even if it seems that there isn't any benefit from doing this, there's a lot of code which uses this functionality to good effect in the iOS/OSX frameworks.

Friday, February 18, 2011

Gerrit Git Review with Jenkins CI Server

References

Last week, I published a piece on Gerrit and Jenkins about how I saw the future of distributed development teams. It got a fair number of views and positive comments (both off- and on-line) so I followed through with my promise to record a demo of using all the systems together.

This uses Gerrit, combined with Jenkins CI and the Gerrit Trigger Plugin. I skipped on explaining how to set this up in the video, preferring to show how it's used so to get a feel of how it all fits together. If there's interest, I might do a subsequent demo on getting started with Gerrit and Jenkins, taking about the installation steps needed. But without further ado, here's the video:

Gerrit Git Review with Jenkins CI Server from Alex Blewitt on Vimeo.

Instead of the turn-of-the-century approach to attaching Git patches to Bugzilla, Eclipse needs to get off the wait-and-see bandwaggon and jump onto Gerrit. Gerrit can collect contributor agreements before accepting any pushes, and will also have full traceability for where the changes came from by virtue of 'Signed-off-by' tags that can be added by committers. All you need to do is have a commit hook which verifies that either the author or the signed-off-by signatory is a valid Eclipse committer (Gerrit will ensure that in order to push, the contributor agreement has been filled out) and then we never need bother with attaching patches to bugzilla again.

Although I didn't explicitly make mention of it in the video, the other advantage of using Gerrit to store code reviews is that anyone can pull down the code review in-situ in the repository at any time. No more patches going stale in Bugzilla; you can pull them out of the review at any time, and rebase where necessary to bring them back up.

As Eclipse starts moving down the road towards Git migration, it needs to seriously think about the tools in place to make it happen; and if there's not the support for that internally, then it needs to be made to happen externally. You wanted to know how to spend the Friends of Eclipse donations – getting Gerrit and Git in place for all is what's wanted.

Oh, and if you're wondering, the graphical countdown timer is Naughty Step, an app I wrote so that my kids would know how long they have to go when they were naughty, hence the name. It's a bargain at 0.99¢ if you want to buy a copy :-)

Tuesday, February 15, 2011

Reflections on Objective-C

References

I've been using Objective-C on and off for a couple of decades, and although it has a few warts and historical dependencies, it still remains one of my favourite languages to write in. Partially, it's because it was my first OO language (Java wasn't invented until 3 years afterwards), and partly because the language has a number of powerful features that don't occur in other languages.

Many older developers may feel the same way about Smalltalk; certainly, a number of patterns survived though Smalltalk itself never grew to a significant impact in the industry.

(One can argue that only one company seriously backs Objective-C; but then again, only one company backs VB.Net, and that's widely used too.)

What is more difficult to understand is the number of developers who code in C++ but either haven't tried, or don't want to try, Objective-C. gcc, the main compiler in the open source world, has had Objective-C support since before Java was invented. Yes, there have been a few recent changes to the language (blocks, properties) but it's still a viable toolchain.

One of the main reasons to prefer Objective-C over C++ is a sane and consistent memory management model. With any C based language, you have to know if you are responsible for freeing the result of the call, or if not, strcpy'ing the results to a new bit of memory. This results in significant churn, as each caller in the stack may spend it's work bitshuffling data around just to ensure ownership of data. Quite often this results in APIs designed with pass-the-pointer semantics instead, where a block of memory is pre-allocated up the stack, then recursively passed down the layers until an API fills it.

None of this helps with dynamically generated content, and is significantly error prone. A significant chunk of code can be either error handling, or bitshuffling, masking what is happening under the covers.

Objective-C gets this right by using refcounting (or gc, but more on that later). When initially created with "new" (or "alloc/init"), an object's refcount is set to 1. Refcounts are incremented with each "retain" call, and decremented with "release". Logic in the release call checks to see if it's the last one out the door (refcount 0) and switched off the lights (or "dealloc" call).

This is built into NSObject, so everything has these semantics. Objective-C users never call "dealloc" themselves; whereas C++ users have to call this all the time.

OK, so Objective-C does refcounting consistently. Fine - there are C++ libraries that support this model as well. What makes Objective-C better?

The trick up Objective-C's sleeve is to realise that there are situations where you want to return a temporary result - say, the result of a sprintf type operation - and then immediately discard the result. Clearly the API can't free the memory before it returns it, but nor do we want the callers to suffer the same fate as their C and C++ brethren where the only way to know if they should free it is to read the API docs.

Objective-C has a unique way of dealing with this problem through the use of autorelease pools. An autorelease pool is a global list of objects to which any thread has access. Any object which is created in a temporary fashion can be added to the autorelease pool, so that it doesn't go out of scope. This allows a return-and-forget method to add a transient object to the autorelease pool and then neither it nor any callers need worry about it again.

Left unabated this autorelease pool would grow in size and eventually cause a depletion of memory resources. Instead, this autorelease pool is flushed periodically ("drained") to remove any objects in its scope, and replaced with a new autorelease pool. That way, memory is recycled and caller code doesn't have to worry about doing extra work.

The autorelease pool is normally set up as part of the system when loading an app; you can see in the main.m of a generated project that there's an autorelease pool creation at the start of the code. If you don't have one, you see messages about objects being created with no autorelease and just leaking; but since you rarely change main.m this isn't usually a problem.

Most iOS and OSX apps are runloop based. This means at the top level of the app, there's a "while(true) doStuff" type loop that runs through the possible inputs (network activity, keyboard, mouse etc) and if so runs the appropriate action. (Rather than just CPU spinning, it uses external triggers to decide when to take action though.) What this means is there is a place which can be used to regularly drain the autorelease pool, which is done at the end of each cycle. So mouse move event can generate all sorts of trash objects in the stack, and at the end of that action, the autorelease pool is emptied.

Although I indicated that the pool is global above, there can actually be more than one pool. What happens is the autorelease pool uses thread-local storage to store its pool (and by the way, multiple threads and multiple runloops need to have heir own autorelease pools allocated therefore). When you invoke "autorelease" on an object, it asks the NSAutoreleasePool for the pool in use, which obtains it from the thread local storage.

In addition, you can have multiple autorelease pools per thread. It acts as if there were a stack of autorelease pools and the current pool is the top one in the stack. (It's also this code which prints out the "just leaking" message if the stack of pools is empty.)

To push a new pool on the stack, just create a new autorelease pool. Any subsequent autoreleased objects which get created are added to this pool instead of the original one. Once that pool is released, it gets popped off the pool stack, and subsequent autoreleases will go back to the prior pool.

It's not generally necessary to create your own autorelease pools, but if you have a tight loop in a while block that is long running and generates a lot of temporary objects, it can improve performance to create/release an autorelease pool outside the while loop and then place a pool drain each loop iteration - or every other, or every third...

The autorelease mechanism is just one of the things which makes Objective-C superior to C++, in my opinion. It enables patterns such as "if you init, you ownit" whilst others like NSString's stringWithFormat will give you a pre-autoreleased string back.

Having a standard retain count enables the "you keep it, you retain it" pattern (now somewhat superseded by the retain property generator). It also allows tools like llvm to perform semantic analysis on the code to verify that the retain/release calls are balanced.

Like other reference counting mechanisms, it doesn't prevent circular references pinning memory or prevent leaks by forgetting to release or errors caused by double releasing. However as a standard language feature it gives the developer far more time to concentrate on writing the app and not debugging memory related errors. (The OSX runtime permits the use of zombie tracking, as does the "Leaks" app, to assist with such cases.)

Hopefully this post will reach my C++ counterparts to explain one of the key differences between the languages. It's such a key feature - like RAII is in C++ - that it's a core part of how to write Objective-C code. By explaining the principles, and without using any scary at or square brackets, I hope this conveys how memory management in Objective-C works. And for those who are new to the Objective-C platform, I hope it's instructive to know what's going on under the covers.

Saturday, February 12, 2011

Comic Relief 1993

References

In 1993, whilst a student employee at IBM, we filmed a stunt for Comic Relief in a pretend kidnapping of Nick Temple, then the MD of IBM UK. He was an incredibly good sport (we actually managed to rip his jacket at one point, if I recall correctly). I played the part – what else – of one of the heavies. Most of the other young faces in the meeting room and auditorium were other students at the time.

Sadly, the video never aired on national television as it was deemed too realistic (seriously!) at the time. Now, thanks to the work of wDGS13 it has been digitised and put on line; and of course, thanks to WDGS25 who organised and procured and drove the getaway car. (They didn't use the shot of the car driving away with me hanging out of the door, for some reason ...) Enjoy – WDGS01.

Wednesday, February 09, 2011

Someday ...

References

Someday, all software will be built this way.

I've been a fan of Git for a while now; I've written a few Git posts in the past including the explanatory Git for Eclipse users post, which explains the key differences between DCVS and CVCS.

I've also been using Gerrit for a while via the EGit review page at Eclipse. Gerrit is a code review system, based on the features that a DVCS gives you. If going Git is the main course, then Gerrit is the dessert; with Jenkins being the cherry on top.

Code review systems generally fall into one of two categories, which have advantages and disadvantages:

  • Pre-commit reviews, in which a diff attached to a review system
    • Advantage: avoids polluting the version control system with versions which may be inaccurate or may not make it
    • Disadvantage: committed code post review may not exactly match the proposed change
  • Post-commit reviews, in which the change is committed and then later blessed
    • Advantage: ensures that exactly that change is part of the version history
    • Disadvantage: potentially pollutes the version control system with changes which need amendment or may have to be rewritten or aborted

There are of course other approaches, such as the man-in-the-middle commit (such as used by Eclipse, where patches are uploaded as attachments to a bug review system and then committed, possibly with changes, by a separate committer). However, this approach tends to work with open-source systems; and it doesn't solve the problem of committers (with write access) having their changes reviewed. In fact, the patch-attach tends to go stale far quicker than review systems.

Branches are the way forward

So how does DVCS help here? Well, the key problem with review-after-commit is not so much that the change exists in the version control system, but rather most implementations use review-after-commit-to-HEAD(trunk). As such, one bad code commit causes subsequent code commits to be invalidated, or at least, contain code which can't be easily undone (other than by committing a reverse patch).

The solution, therefore, is to use branches. If you commit onto a branch, you don't affect any developers on HEAD(trunk). The branch can be reviewed independently, suggestions or improvements made, checked against a build system, and then finally merged onto HEAD(trunk) post-review. Of course, if you have two concurrent changes, you need two branches. And if you have a team of ten developers, you need 10(*2) branches. Go down this road for any sensible amount of time and you quickly realise that you need one branch per change, so that no changes interfere with another.

Of course branches bring about merges (and merge conflicts), so using a tool which is implicitly based around branches and merging is a no-brainer. So using Git (or Hg), you can develop changes on a local branch, push that branch to a central warehouse, ask others to review it, and then merge exactly that change onto master(HEAD, trunk). Even better, since it's a DVCS, that merge commit will have the full history of the change (including the sign-off) so someone can say "Alex approved 01b3cd" and you know exactly what change that refers to.

There are a couple of variations on this theme in the Git world (Hg users tend not to like re-writing history) which involve 'squashing' the branch (i.e. removing all the intermediary steps and replacing with a single unified diff of the branch) as well as 'rebasing' (moving the diff forward to master (HEAD, trunk) instead of creating a merge node (which joins together two otherwise unrelated Git trees). The different configurations here don't really affect the way that the review-after-commit-and-merge works; when you bless that code, you bless that code.

Enter Gerrit

Gerrit is a review-based tool which operates on a Git repository. (There's nothing significant that would prevent a tool like Gerrit working on Hg; but like GitHub, innovations tend to happen faster with Git.) The way Gerrit works is by being a process-in-the-middle between your local Git repository and the 'blessed' central Git repository. Once you use it, it's common for Gerrit to become the de-facto owner of the Git repository that it fronts; though since DVCSs don't have an enforced centralised Git repository as such, this can be changed if desired. It is common (in organisations replacing legacy version control systems such as CVS and SVN) to have a centralised server to host source data, which may be on higher-resiliency and backed-up hardware; so the central Gerrit instance can be an advantage for those looking to make the switch.

You configure Gerrit as a remote repository, much like you would with any other. In fact, you can use it as the only remote repository, by cloning from it initially. The EGit project, for example, is available via ssh://username@egit.eclipse.org:29418/egit.git, although it's faster to clone/pull from the unauthenticated http://egit.eclipse.org/egit.git, which is the same underlying on-disk data.

To push changes to Gerrit, you configure a remote based on the authenticated access. You also don't push to refs/head/master (which is the Git synonym for HEAD or trunk) as you might if it was a standalone Git repository; rather, you push to refs/for/master, or for refs/for/other if you want to submit a different branch. You can configure it with a wildcard, so any local branch can be pushed:

git config remote.review.url ssh://username@egit.eclipse.org:29418/egit.git
git config remote.review.push refs/heads/*:refs/for/*

The refs/for/master acts like a PUT request; there isn't a single branch with that name, but rather, each push to refs/for/master results in the creation of its own unique branch. In the case of EGit change I1c5ec794, the temporary branch allocated was refs/changes/46/2446/1. Other changes have their own branch; EGit change Ie639e366 corresponds to branch refs/changes/47/2447/2. (The 2 at the end in this case indicates the second version of the change; though this is a Gerrit specific notation. The first two digits are merely a directory discriminator based on the last two numbers of the change, so it contains 47/2347/*, 47/2247/* and so on.)

Once the change is in the DVCS, it's possible to generate diffs or any other kind of processing with standard Git tools. Not only that, because it's on a publicly accessible remote DVCS server, you can even checkout that particular changeset. Gerrit even helpfully contains the command needed to do that in the web page:

git fetch http://egit.eclipse.org/r/p/egit refs/changes/47/2347/2 && git checkout FETCH_HEAD

This makes it possible to bring a proposed change down, run tests or any other kind of processing on it. In fact, to make it really easy, the above change implements a “fetch from Gerrit” action in Eclipse, which permits you to check out a change and create a local branch from it.

By this point, you may well be thinking Gerrit sounds like a handy review tool. As well as storing changes, you can review the diffs, comment on files on a file-by-file or even line-by-line basis as well as on the review as a whole. But it also stores review flags, which can include the standard kind of +1 and -1 votes. By default, it needs each change to be reviewed to get a +2 review change vote, though this can be configured.

There's also a +1 and -1 “Verified” flag, which was introduced to support Android development (which uses Git and Gerrit). The purposes of Verified is to ensure that the code compiles and passes its test suite, rather than the code-review which is general sanity.

Enter Jenkins

Jenkins is a continuous integration server with a short history but a tumultuous past. As a continuous integration server, it can check out and execute builds, run tests and mail results. If you're used to continuous integration servers, you may have seen the ability to check for changes via the SCM and kick of builds automatically.

Jenkins is designed to permit arbitrary triggers to kick off builds. This can include timed triggers (nightly), by polling an SCM for changes, or by many other means. Normally, a triggered build will just kick off a build based on a specific branch (e.g. master).

The Gerrit Trigger allows you to kick off a build when a new review in Gerrit is posted. Not only that, but it will also check out exactly the proposed change, compile it, and run all the tests – automatically. When it finishes successfully, it can post a +1 Verified (or -1, if it fails). All of this can happen before the reviewer has even had time to see your change, so that if the change causes a build failure, it can be rejected automatically.

In addition, there's a Git Plugin that can be used to tag the build and push changes up to the remote repository, so there's a persistent record of the change having successfully been built.

It's interesting that Kim Moir has posted on the Eclipse build process as well today. It looks like Hudson will be a key part of that, and the Jenkins and Hudson plugins are compatible (for the time being, at least). But if Eclipse is going to move over to Git full-time, then Gerrit will effectively become mandatory, either on Eclipse hardware or managed by individuals on behalf of specific teams.

Submission

So once a change has been pushed to Gerrit, automatically built and tested, flagged as verified, and reviewed by a couple of other developers, the change is good to go. Since Gerrit has all the information, it can apply the change to the master branch on the user's behalf.

Merging the branch can take a number of different forms; cherry-pick allows you to write the change on top of master, merge will create a merge node, and fast-forward requires the patch be 'up-to-date' before being committed. Either way, the contents of the master version control branch always go through a review and test process, whilst it's possible to guarantee that the changes merged are exactly the ones approved.

Conclusion

Once you go Git, you don't go back. Once you go Gerrit, unreviewed code becomes unthinkable. And once you go Jenkins, you don't even need to compile and test the code yourself. Someday, all software will be built this way.

Friday, February 04, 2011

Conferences this year

References

I'll be at a few conferences this year. Unfortunately, EclipseCon won't be one of them, but I'll be at:

If you're going to any of these conferences, I'll see you there! If not, there's still places (and early bird discounts) for the JAX London and QCon London conferences.

Tuesday, February 01, 2011

Final IPv4 blocks allocated

References

It looks like I IPv6 enabled my blog just in time, with the final IPv4 allocations being handed out. Onward and upward to World IPv6 day; check out if your browser can handle it by going to Test-IPv6.com.