Alex headshot

AlBlue’s Blog

Macs, Modularity and More

XML as a top level MIME type

Rant 2005 Xml

One of the most frustrating things about XML is how poorly it is captured/defined as part of either a MIME type or standard extension. It's clear that XML is the de-facto standard for representing structured data that needs to be processed by both humans and computers, and we're still in a situation where file extension and MIME type are completely varied.

For example, the XML-ised version of HTML, called XHTML, can be represented using the MIME types text/html, application/xhtml+xml as well as text/xml and application/xml. At least they've got their own file extension, .xhtml, that can be used to denote files.

Sidenote: MIME stands for Multipurpose Internet Mail Exchange, and was originally used to define what kinds of documents were being attached to mail messages between systems that may not know about extension types. Most webservers have pre-defined mappings between file types and extensions, and the MIME type is recorded with that mapping as well; in /etc/mime.types on Unix systems, and in the 'File Types' that is visible in Windows.

MIME types are officially defined in RFC 2046 and defines the initial top level types as:

textual information
image data
audio data
video data
any application-specific data that doesn't fall into the above categories
an encoding that allows multiple items, potentially of different types, to be concatenated together (this is how mail messages with attachments are sent)
an e-mail message, mostly used with the rfc822 subtype

Each of these top-level types have a number of subtypes, such as text/html, text/xml and text/plain that are dependent on the top type.

The authors note that "It should be noted that the list of media type values given here may be augmented in time, via the mechanisms described above, and that the set of subtypes is expected to grow substantially."

Coming back to XML, even for XHTML documents, there's still a number of potential MIME types that can be used (mostly to do with backwards compatibility). And this has started a disturbing trend for XML documents to have either xml+ or +xmlin their MIME type. As a result, you have a number of different types of XML document, such as image/svg+xml, application/xml+html. Thus, there's no easy way if you had any prior knowledge of whether a document is an XML one (and thus should be in text) or a binary one.

Even though RFC 3023 explicitly disses the possibility, it would make far more sense to define a top-level MIME type to encapsulate XML-encoded documents. This is at least as sensible as breaking down documents into 'text', 'image', 'video', 'audio' and 'everything else', where 'everything else' seems to get used by pretty much everything. If we had an 'xml' major type, we could allow processors to know that they were about to receive XML, and do things like character set negotiation (though UTF-8 would be the default) and structural checking, even if the actual validation of the DTD/schema may not be done. Partially the argument for not adding an 'xml' type "because it would break existing stuff" isn't tremendously valid; that is a recipie for killing innovation, and the originators of RFC 2046 explicitly intended for future top-level types to be created. Their second point -- that the MIME type describes the document type, not its syntax isn't tremendously relevant either; after all, we can have plain ASCII images (which are currently described as text/plain instead of image/ASCII -- there's even a Star Wars video in ASCII. And without the top-level type, everything is just dumped in the application/ subtype anyway.

So here's an example where having a top-level type would help; maybe for specific cases (like image/svg+xml) would they not necessarily fit into the xml/ top-level type, but there's really very little type of data that couldn't fit in the xml/ top-level type. Indeed, for the thousands of other types that are created daily by business, they would have an ideal fit in the newly created top-level type.

This brings to my final rant about XML documents. Why do they all end in .xml? I mean, the XML is the encoding type, much like the character set; imagine a whole bunch of documents being labelled .ascii. For example, most RSS feeds end in .xml (and even have the audacity to believe that they are the only type of XML to have their own image XML logo as used by RSS feeds), despite the fact that it's an RSS document. At least some of the early adopters of atom end theirs with .atom (though only generates feeds called atom.xml). And don't get me started on Ant files. If they were all called build.ant instead of build.xml, then it would be trivial to search for a build file (or even have many different build files) which otherwise might not be trivially distinguishable from other .xml files. This is especially true of other files, though at least .xsl has the decency to use its own file type, even if it does mix the XSLT and XSLFO types.

So, I propose a new MIME type and approach to naming file documents. Specifically, I would ban the use of .xml as an extension, and any xml+ or +xml MIME types. Instead, we would have:

ApplicationFile extensionMIME type
Note: no-one uses these file types. I wish they would. But if you've googled and come up with this list, it's not actually standardised (yet)
XSL Formatting Objects.xslfoxml/xslfo

Backwards compatibility is good for most things. But backwards compatibility needs to be tempered; restricting stuff to only be backwardly compatible for ever stifles innovation.