Tuesday, September 27, 2011

Removing CVS from the Eclipse Platform

References

Wayne has proposed that the Eclipse Foundation makes all the CVS Repositories read-only from December 2012. It was always the intention, once Git was accepted by the Eclipse community, to stop supporting at least one of the version control systems, and CVS has certainly had its day.

This leads to an interesting question; if CVS is no longer used by the Eclipse Foundation, should it even ship CVS by default? CVS has been present in the Eclipse Platform build since its inception, and additionally with the EPP packages for every release train since.

Whilst there have been Subversion projects (Subversive at Eclipse and Subclipse at CollabNet/Tigris), they've never shipped with the default platform; in some cases, due to additional (non-EPL) dependencies that are required.

That's not to say CVS wouldn't be available – it would still be available as an installable feature in Eclipse update sites, much like Mylyn (or Subversive). But perhaps it is time to remove it from the default downloads.

I argue this should be done before Juno is released. There were several projects who wanted to stay on CVS until after Indigo shipped, and to maintain their CVS repositories to support Indigo SR1 and SR2. But we will always have the case where immediately after Eclipse Release Xxx ships, there will be a need to support Eclipse Release XXX SR1 and Eclipse Release XXX SR2.

CVS is an archaic technology and all Eclipse projects must move from it at some point. Making the CVS feature an optional install across the board is a signal to those that still need CVS that their time is drawing near. It wouldn't prevent the build scripts from working, nor would it prevent people from using CVS if they needed to. But it would mean that a signal has been raised.

Delaying this default removal until Juno will bring us into exactly the same discussion, twelve months from now. The only way to remove it will be early in the milestones, and leaving it until after Eclispe Juno SR2 will be too late in the Eclipse Lightyear* release schedule for that one as well.

I say, remove CVS from the default downloads by the next milestone and replace it with an optionally installable feature.

* No, I have no idea what it will be called. Lightyear would be cool though.

Git Tip of the Week: Git Archive

References

This week's Git Tip of the Week is about archives. You can subscribe to the feed if you want to receive new instalments automatically.


If you want to extract the contents of a Git repository, perhaps to make it available for a source download somewhere, then you can of course zip (or tar) up the contents of the repository with a command line tool.

However, there's another way of doing this with a Git repository, using the git archive command. This takes the contents of the current working tree and generates a zip (or tar) file.

One key advantage of using Git to perform the archive rather than a command line tool is to avoid accidentally capturing the (large) .git directory, or any work-in-progress content. For example, if you have just run a build, then zip (tar) will include the content of the build output as well.

Another advantage is that you can extract the content of the repository at an arbitrary revision. Whilst HEAD is used by default, you can put in any tree or tag in the extraction, which makes it useful for being able to generate a source tar ball from a given tag (even if that tree doesn't happen to be the default). For example, let's say we wanted to generate a source bundle from the EGit repository


(master) $ git archive --format tar v1.0.0.201106090707-r | gzip -9 > /tmp/egit-v1.0.0.tgz
(master) $ tar tzf /tmp/egit-v1.0.0 | head
.eclipse_iplog
.gitattributes
EGIT_INSTALL
LICENSE
README
SUBMITTING_PATCHES
org.eclipse.egit-feature/
org.eclipse.egit-feature/.gitignore
org.eclipse.egit-feature/.project
org.eclipse.egit-feature/.settings/

This feature is used when browsing the contents of a repository via cgit. It's possible to click on any link (commit or branch) and download a tgz of the repository at the time. All of this is powered by git archive. In fact, you can create an archive from a remote repository, without needing an explicit clone – though it's worth noting that most http repositories don't support this.


(master) $ git archive --format tar -9 --remote ssh://server.org/path/to/git > /tmp/remotearchive.tgz

Finally, it's possible to extract only a subset of files rather than the whole repository. If you wanted to generate only the docs for a project, and they were all present in the docs/ folder, then you could create an archive just containing that with:


(master) $ git archive --format tar -9 HEAD docs > /tmp/docs.tgz

It's fairly common that git describe will be used in conjunction with git archive in creating the name of the output file, and optionally, the global prefix to put in the compressed archive output as well:


(master) $ NAME=project-`git describe`
(master) $ git archive --format tar -9 HEAD docs > ${NAME}-docs.tgz 

Come back next week for another instalment in the Git Tip of the Week series.

Monday, September 26, 2011

OSGi Community Event 2011 - OSGi, Past Present and Future Keynote

References

As I mentioned a couple of weeks ago, last week I gave a keynote at the OSGi Community Event. During the event I unveiled the Omega problem, which is the condition whereby projects are given inexplicable and unmemorable Greek letters, which unnecessarily interferes with getting started with OSGi.

However, that wasn't he only focus of the event. I also highlighted P2 p2's inefficient repository mechanism. Using figures derived from the Eclipse 3.7.0 platform update site, I presented the following chart:

These figures are calculated from the p2 content Jar located at http://download.eclipse.org/eclipse/updates/3.7/R-3.7-201106131736/content.jar. It's worth noting that this file alone is 354818 bytes large (347k) and that now that 3.7.1 has been released, that adds an additional http://download.eclipse.org/eclipse/updates/3.7/R-3.7.1-201109091335/content.jar or, at 361789 bytes (353K), of extra content that has to be downloaded each time you update Eclipse.

Of those files, the majority of the data is worthless. Approximately 35 of the content (or 60%) is appendix; present but entirely useless. Only 25 of the data contains useful information.

The problems can be boiled down to:

  • Multiple redundant copies of the full text of the EPL license; in the 3.7.0 release, 101 copies alone (and thus, the 3.7.1 release adds another 101 copies)
  • Pretty printing of the XML file, when it's supposed to be parsed programmatically
  • Unnecessary data describing how many child nodes a tree element has, when the tree already has that property

None of these are new problems; it was first reported back in September 2010 (and again in June 2011). So when you're updating to Eclipse 3.7.1, the reason you have to download 700k is because of the way the update mechanism was designed.

Raw data

The data was calculated by stripping the file of unnecessary whitespace (c.f. license/copyright, size), and recompressing the JAR file. Differences between the compressed (JAR) file sizes were reported in bytes. For comparison, the same content files were compressed with GZip to compare against corresponding JAR file waste, not shown above.

  • Bundle data: 148666 bytes (JAR) – 143672 (GZip)
  • License/Copyright text: 127388 bytes (JAR) – 127799 (GZip)
  • Whitespace: 81303 bytes (JAR) – 59611 (GZip)
  • Size: 4432 bytes (JAR) – 4146 (GZip)

Help me OBR, you're my only hope

The main reason for mentioning these limitations in p2 is the fact that OBR is due out as part of next year's Enterprise OSGi release. It is my firm hope that these issues get fixed, and that the spec mandates decent compression (i.e. using GZip instead of JAR format) combined with mandating the generation of XML files without excessive whitespace.

Anyway, you can read all about it yourself; the slides are available on SlideShare.net, or if you want, you can watch a video of the slides and me talking over them on YouTube. All of the other presentations are available, and you can find my write-up on InfoQ if you want to read more.

Wednesday, September 21, 2011

The Omega Problem

References

What is the Ω problem? It's a problem caused by the use of greek letters and astrological signs to refer to OSGi projects (or ΩSGi projects), thereby hiding the purpose of the project itself behind a name that is likely to be both unfamiliar and eminently forgettable.

There are certainly Java projects which have a good naming convention, such as the Spring project set – where the name of the project gives immediate reference but also is easy to discuss in community forums, project mailing lists and on CVs.

  • Spring Data
  • Spring Web Services
  • Spring LDAP
  • Spring Social
  • Spring Batch
  • Apache Commons Lang
  • Apache Commons IO
  • Apache Directory Server
  • Apache RegExp

Then there are projects whose name you're not even sure how to spell, let alone remember when you're trying to advertise the benefits to muggles. And since they're all so similar, you often end up confusing one for another:

  • Eclipse Epsilon
  • Eclipse Gemini
  • Eclipse Libra
  • Eclipse Virgo
  • Eclipse Orion
  • Eclipse Orbit
  • Apache Aries
  • Apache Gogo
  • Apache Karaf
  • Apache Sigil
  • Apache Tuscany

About the only thing you can take away from this list is that both Apache and Eclipse make some weird sounding project names.

The problem is that in order to progress with OSGi in an Enterprise world, you need implementations of some of the Enterprise services. Whilst the Core services are all part of the standard release (e.g. Equinox, Felix), the Enterprise services are extras that are required in order to do anything but are not shipped by default. So, to talk to a database you need the Enterprise JDBC service, the Enterprise JDNI service and quite probably the Enterprise JPA and JTA services as well.

If you buy into Big OSGi Servers – like WebSphere – then it's quite likely that all of these will be present and Just Work™. But if you're trying to kick start a project in Felix or Equinox (or Knopflerfish or ProSys or …) then knowing which the magic set of bundles required is the key.

In an ideal world, these bundles would already be grouped together and work in an out-of-the-box package. The fact that we have so many bundles to pull together – assuming you can remember the names – is one of the reasons why Enterprise OSGi isn't as used as it could be.

For Enterprise OSGi to be successful, it needs to be as easy as when the likes of Tomcat servers provide a JNDI mapping for your database, and can surface that to a Servlet looking up a jdbc/DataSource. Ideally, this wouldn't need code changes to work for non-OSGi code, in order to allow the migration of existing JPA-enabled persistence units and databases configured externally to the bundles that ultimately get wired to them. Anything more complex and developers are likely to shy away from using OSGi and be stuck with Hibernate and Tomcat through ease of use alone.

The Omega Problem was coined at the OSGi Community Event 2011. It uses a greek letter for irony, but also because it fits with the ΩSGi name as well. Finally, is a nod to the footnote in The Last Continent, Page 74, which can be seen here, and is a parody of the Mad Lib Thriller Title genre. There's even a site which helps you make you own.

Tuesday, September 20, 2011

Git Tip of the Week: Packfiles redux

References

This week's Git Tip of the Week is about packfiles. You can subscribe to the feed if you want to receive new instalments automatically.


When Git compresses files, it does so using pack files, which are collections of objects compressed into a single file.

One key difference between Mercurial and Git is the use of the on-disk storage. Mercurial uses a per-file storage model, where all of the history about that file is stored within that one unit. This is similar to CVS' use of ,v files to store versioned information about a specific file, although Mercurial's format is much better tuned. Git on the other hand has a logical object model, and whether those objects are in a compressed pack file or 'loose' on disk makes no difference to Git.

This enables Git to repack objects, and to generate new pack files, subsequent to initially creating them. As long as the object is available, it doesn't really matter where it came from – after all, the unique hash will always point to the same content in the object regardless of where it is loaded from. Of course, since a pack file is immutable after creation, any changes are implemented by the creation of new pack files and then disposing of the old ones.

Since the pack files contain many objects, it is possible to perform delta compression against the objects. In other words, if there are five versions of a file, all largely the same but with minor differences, then an initial version of the file can be stored as is, and just the minor modifications stored afterwards. (These typically include an insert of a range of bytes, a delete of a range or a transpose of a range of bytes.) So although logically the pack file can contain multiple objects, it actually only stores one set of each change.

There are several ways of storing files in the pack file as well. For example, consider a file with the contents 'A' and a subsequent version 'AB'. This can be stored as an A and then +B, or it can be stored as AB and -B. The latter is more performant, since the set of changes involves a deletion of an existing range, so only needs to store the start and length of the deletion occurring. As files grow over time, it also means that the largest file is often stored as-is (typically, the most recent) which gives faster access for that than for previous versions.

Clones and fetches

The other time pack files get created are when you clone or fetch/pull from a Git repository. The 'smart http' protocol actually uses the Git protocol over an HTTP wrapped connection; what happens is there is a bit of back-and-forth in deciding what hashes you have, and what you want. Ultimately, this conversation results in the server knowing what sets of objects to send to the client.

The most efficient mechanism for sending results back to the client is in the form of a pack file. The server knows what to send back, so it builds a pack file containing just those objects. This will involve compressing (both delta compression as well as deflate compression subsequently) and then sending the objects back. You even see this in the chatter that is sent in the message channel when you clone or update the repository:


(master) $ git pull
remote: Counting objects: 1177, done.
remote: Compressing objects: 100% (352/352), done.
remote: Total 1018 (delta 609), reused 787 (delta 390)
Receiving objects: 100% (1018/1018), 378.46 KiB | 206 KiB/s, done.
Resolving deltas: 100% (609/609), completed with 110 local objects.

The first line is the server counting the number of objects (blobs, trees, commits etc.) that are needed as part of the object set. Roughly this is doing the same kind of work as a git rev-list might do; in other words, it's figuring out the set of objects necessary to send.

The second and third lines is the server saying it's now compressing those objects into a pack file. This is done on the server side before it can start sending any data, as the pack file itself isn't written to in a streaming fashion. In addition, it can write out either full objects or delta objects. (Normally a pack file is self-contained; i.e. it will only create deltas against other objects that are in the same pack file. For clones, the pack file may store references to objects that the client already has, but without copying the object in the pack file itself.)

The fourth line is the client receiving the pack file from the server, once the server has created it. Metadata in the pack file gives an indication of how far it's gone through, which is why you can get an update occurring as the file is received.

The final fifth line is the client assembling the data and closing any of the loops that might be present (i.e. local references to other objects). It will also generate an index file of the objects, since they can be calculated based on the SHA1 hash of the objects themselves.

All of this put together means that when you clone a project for the first time, you end up with a large pack file that contains everything. Subsequent updates can bring down smaller pack files which can build up on the files before. At any time (or when you run git gc manually) these pack files can be reorganised into more appropriate storage units; for example, condensing a number of pack files into one or a few pack files. Since this operation is persistent (i.e. the large pack file will retain its compression behaviour) it may be beneficial to compress a repository periodically with git gc --aggressive.

Summary

The Git pack file is the unit which makes object storage efficient, and also supports the efficient transfer of data between client and server during either an initial clone or a subsequent fetch/pull operation. Although pack files are immutable once created, they can be re-created with more compressed content on an as needed basis.


Come back next week for another instalment in the Git Tip of the Week series.

Tuesday, September 13, 2011

Git Tip of the Week: Objects and Packfiles

References

This week's Git Tip of the Week is about objects and packs. You can subscribe to the feed if you want to receive new instalments automatically.


So far, we've talked about commits, trees and objects. We've seen how they bind to the logical object model as well as being represented on disk in the .git/objects directory.

But storing every version of every file in separate files (albeit compressed) is going to be a huge waste of space, right? Yes, there's some sharing of identical content between commits, but Git would hardly be the efficient store that it's known for with storage structure like that.

Pack files

Fortunately, Git has the ability to merge together multiple objects into single files, known as pack files. These are, in essence, multiple objects stored with an efficient delta compression scheme as a single compressed file. You can think of it as akin to a Zip file of multiple objects, which Git can extract efficiently when needed.

Pack files are stored in the .git/objects/pack/ directory. For new projects, this is likely to be empty; what happens is that Git starts off adding all files as non-packed objects, or loose objects. One of the reasons it does this is because as you're working through changes, you're quite likely to re-write various files (blobs) and directories (trees) before you commit. In fact, each time you do a git add to stage a file, you're creating a new object in the loose objects structure.

What happens is that periodically (or on user demand), Git will run a compression on the loose objects. This is triggered either by a git gc request, or automatically after various thresholds have been met. Git will then create the pack file and remove the loose object files.


(master) $ touch empty
(master) $ git add empty
(master) $ git commit -m "Empty"
[master (root-commit) cab1545] Empty
 0 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 empty
(master) $ ls .git/objects/
41	ca	e6	info	pack
(master) $ ls .git/objects/pack/

You may recognise the 'e6' directory as being the prefix of the empty file in Git, which we covered earlier and is identified by e69de29bb2d1d6434b8b29ae775ad8c2e48c5391. However, at this stage, there's no content in the pack directory. What happens if we pack it?


(master) $ git gc
Counting objects: 3, done.
Writing objects: 100% (3/3), done.
Total 3 (delta 0), reused 0 (delta 0)
(master) $ ls .git/objects/
info	pack
(master) $ ls .git/objects/pack/
pack-0c1ff4e31096ebcdb390b30ebe763ae15de650eb.idx
pack-0c1ff4e31096ebcdb390b30ebe763ae15de650eb.pack

Where did the objects go? Well, they've been compressed into a single read-only pack file. We can still address them using their hash, even if they're not loose files any more:


(master) $ git cat-file -t e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
blob
(master) $ git cat-file -s e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
0

The pack file's contents on disk is smaller than the set of files on their own (though in trivial examples like this, there isn't that much difference between them). The pack file is actually made up of two entries; the index (.idx) and the pack (.pack) files. Whilst the latter stores data, the former stores a table-of-contents list of objects contained within the pack itself:


(master) $ hexdump .git/objects/pack/pack-0c1ff4e31096ebcdb390b30ebe763ae15de650eb.idx
0000000 ff 74 4f 63 00 00 00 02 00 00 00 00 00 00 00 00
0000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0000100 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01
0000110 00 00 00 01 00 00 00 01 00 00 00 01 00 00 00 01
0000330 00 00 00 02 00 00 00 02 00 00 00 02 00 00 00 02
00003a0 00 00 00 03 00 00 00 03 00 00 00 03 00 00 00 03
0000400 00 00 00 03 00 00 00 03 41 7c 01 c8 79 5a 35 b8
0000410 e8 35 11 3a 85 a5 c0 c1 c7 7f 67 fb ca b1 54 54
0000420 f6 75 88 fb 81 27 96 75 09 38 77 09 7a 75 21 de
0000430 e6 9d e2 9b b2 d1 d6 43 4b 8b 29 ae 77 5a d8 c2
0000440 e4 8c 53 91 7d 73 67 fc 61 d0 d2 e8 6e 76 00 29
0000450 00 00 00 85 00 00 00 0c 00 00 00 b1 f5 5c c2 b5
0000460 8e 21 45 55 02 06 88 64 e9 1b 8b 52 75 c4 46 3d
0000470 57 df 0b ca 6a 8f f3 57 6c d4 97 78 df 30 1d bc
0000480 4d 24 1e a4                                    
0000484

You'll recognise in the hex dump of the index the 'empty object' stored in Git (e69d..5391), along with the tree containing the empty file (417c…67fb).

The purpose of the index file is really a marker to tell Git that the corresponding object is in this pack file. In this case, we've only got one pack file but large repositories will have many such files. The index allows Git to load many small files to determine the answer to “Where are these objects?” so that it can extract them in the most efficient manner.

Summary

Whilst Git stores objects in loose form whilst you work on new changes, it will compress them into pack files to take greater advantage of delta compressions. This happens when you run a git gc or when various thresholds are met automatically. It also explains why Git's storage requirements follow a sawtooth like structure; each time the ramp goes up, it's because new objets are being created, and each time it goes down, it's because a pack has been run and new pack files have been created (along with the corresponding objects being deleted).


Come back next week for another instalment in the Git Tip of the Week series.

Monday, September 12, 2011

Speaking at OSGi Community Event

References

I'll be giving a keynote at the OSGi Community Event in Darmstadt, Germany next week, entitled OSGi: Past, Present and Future. In the talk, I'll do a retrospective on how we got where we are today, a view of what is happening in today's modular environment, and a peer into the future of the kind of challenges that OSGi will face in the future.

The full agenda is available, which brings many from the OSGi community together to look at upcoming specifications, such as subsystems, presence in the cloud and even modular EJBs. There's also a lot of information on tools, such as BndTools, The BundleMaker and Eclipse Virgo – as well as best practices such as μServices and OSGi anti-patterns. There are too many to mention individually, so I encourage you to view the full agenda for more details.

The OSGi Community Event is 20th and 21st September (i.e., next week) and electronic registration is open until Friday. Hope to see you there!

Tuesday, September 06, 2011

Git Tip of the Week: Commits

References

This week's Git Tip of the Week is about git commit storage. You can subscribe to the feed if you want to receive new instalments automatically.


Last week we looked at the way trees are stored in Git (and the week before how objects are stored in Git). We're now going to see how those are hooked up to commits, which are the basis of branches, tags and the like. Here's an example commit:


(master) $ git cat-file -p HEAD
tree 2b61e34a91ca9780ea2f943e72f1a4a022cdd206
parent f44c95384463187acd83ff418ddd9c48659db8dd
author Alex Blewitt <alex.blewitt@gmail.com> 1314178977 +0100
committer Alex Blewitt <alex.blewitt@gmail.com> 1314178977 +0100

Another empty
(master) $ git rev-parse HEAD
ca5fc4f022595972639331adcab40d810b9882a0

It's not going to come as a surprise that a commit is a hashed object, stored in exactly the same mechanisms as blobs and trees are. A commit is a hash of the commit message, with an identifying type and length (as for blobs and trees). In this case, the commit message is 236 bytes long, so we write out commit 236\0 followed by the content, and show the hash:


(master) $ (echo -en "commit 236\0"; git cat-file -p HEAD) | shasum
ca5fc4f022595972639331adcab40d810b9882a0  -
(master) $ # Or, we can use this to find the size automatically:
(master) $ (echo -en "commit $((`git cat-file -p HEAD | wc -c`))\0"; →
 git cat-file -p HEAD) | shasum
ca5fc4f022595972639331adcab40d810b9882a0  -

So, given this knowledge, we can create a new commit all of our own. All we need to do is to refer to a tree (such as d2d6bbd1c25c154fcbb045d66e8a6f9b83587a68 from last time), refer to the HEAD as the parent, and add in some timestamp information.


(master) $ # TIMENOW=`date +%s`
(master) $ TIMENOW=1314385772
(master) $ echo -en "tree d2d6bbd1c25c154fcbb045d66e8a6f9b83587a68\n→
parent ca5fc4f022595972639331adcab40d810b9882a0\n→
author Alex Blewitt <alex.blewitt@gmail.com> $TIMENOW +0100\n→
committer Alex Blewitt <alex.blewitt@gmail.com> $TIMENOW +0100\n→
\n→
Manually generated commit" | git hash-object -w --stdin -t commit
195751d8f0822325eb3f234de9c0e720ae53d8ff

We've created our first (manually generated) commit, and it points to the tree from last time. Since all is now well, we should be able to check this commit out:


(master) $ git checkout 195751d8f0822325eb3f234de9c0e720ae53d8ff
Note: checking out '195751d8f0822325eb3f234de9c0e720ae53d8ff'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b new_branch_name

HEAD is now at 195751d... Manually generated commit
((195751d...)) $ ls
anotherEmpty	empty		void
((195751d...)) apple[bar] $ git diff HEAD^
diff --git a/void b/void
new file mode 100644
index 0000000..e69de29

That represents the committed tree which we wrote last time. We can even do diffs between the previous version to find out that the new file is indeed the void that we added previously.

Now we've got the ability to create our own commits, we can take a deeper look into Git's storage structure next time.


Come back next week for another instalment in the Git Tip of the Week series.