One of the most frustrating things about XML is how poorly it is captured/defined as part of either a MIME type or standard extension. It’s clear that XML is the de-facto standard for representing structured data that needs to be processed by both humans and computers, and we’re still in a situation where file extension and MIME type are completely varied.
For example, the XML-ised version of HTML, called XHTML, can be represented using the MIME types
application/xhtml+xml as well as
application/xml. At least they’ve got their own file extension,
.xhtml, that can be used to denote files.
Sidenote: MIME stands for Multipurpose Internet Mail Exchange, and was originally used to define what kinds of documents were being attached to mail messages between systems that may not know about extension types. Most webservers have pre-defined mappings between file types and extensions, and the MIME type is recorded with that mapping as well; in
/etc/mime.typeson Unix systems, and in the ‘File Types’ that is visible in Windows.
MIME types are officially defined in RFC 2046 and defines the initial top level types as:
- textual information
- image data
- audio data
- video data
- any application-specific data that doesn’t fall into the above categories
- an encoding that allows multiple items, potentially of different types, to be concatenated together (this is how mail messages with attachments are sent)
- an e-mail message, mostly used with the rfc822 subtype
Each of these top-level types have a number of subtypes, such as
text/plainthat are dependent on the top type.
The authors note that “It should be noted that the list of media type values given here may be augmented in time, via the mechanisms described above, and that the set of subtypes is expected to grow substantially.”
Coming back to XML, even for XHTML documents, there’s still a number of potential MIME types that can be used (mostly to do with backwards compatibility). And this has started a disturbing trend for XML documents to have either
+xmlin their MIME type. As a result, you have a number of different types of XML document, such as
application/xml+html. Thus, there’s no easy way if you had any prior knowledge of whether a document is an XML one (and thus should be in text) or a binary one.
Even though RFC 3023 explicitly disses the possibility, it would make far more sense to define a top-level MIME type to encapsulate XML-encoded documents. This is at least as sensible as breaking down documents into ‘text’, ‘image’, ‘video’, ‘audio’ and ‘everything else’, where ‘everything else’ seems to get used by pretty much everything. If we had an ‘xml’ major type, we could allow processors to know that they were about to receive XML, and do things like character set negotiation (though UTF-8 would be the default) and structural checking, even if the actual validation of the DTD/schema may not be done. Partially the argument for not adding an ‘xml’ type “because it would break existing stuff” isn’t tremendously valid; that is a recipie for killing innovation, and the originators of RFC 2046 explicitly intended for future top-level types to be created. Their second point – that the MIME type describes the document type, not its syntax isn’t tremendously relevant either; after all, we can have plain ASCII images (which are currently described as
text/plain instead of
image/ASCII – there’s even a Star Wars video in ASCII. And without the top-level type, everything is just dumped in the
application/ subtype anyway.
So here’s an example where having a top-level type would help; maybe for specific cases (like
image/svg+xml) would they not necessarily fit into the
xml/ top-level type, but there’s really very little type of data that couldn’t fit in the
xml/ top-level type. Indeed, for the thousands of other types that are created daily by business, they would have an ideal fit in the newly created top-level type.
This brings to my final rant about XML documents. Why do they all end in
.xml? I mean, the XML is the encoding type, much like the character set; imagine a whole bunch of documents being labelled
.ascii. For example, most RSS feeds end in
.xml (and even have the audacity to believe that they are the only type of XML to have their own image ), despite the fact that it’s an RSS document. At least some of the early adopters of atom end theirs with
.atom (though www.blogger.com only generates feeds called atom.xml). And don’t get me started on Ant files. If they were all called
build.ant instead of
build.xml, then it would be trivial to search for a build file (or even have many different build files) which otherwise might not be trivially distinguishable from other
.xml files. This is especially true of other files, though at least
.xsl has the decency to use its own file type, even if it does mix the
So, I propose a new MIME type and approach to naming file documents. Specifically, I would ban the use of
.xml as an extension, and any
+xml MIME types. Instead, we would have:
|Application||File extension||MIME type|
|Note: no-one uses these file types. I wish they would. But if you’ve googled and come up with this list, it’s not actually standardised (yet)|
|XSL Formatting Objects|
Backwards compatibility is good for most things. But backwards compatibility needs to be tempered; restricting stuff to only be backwardly compatible for ever stifles innovation.