Tuesday, December 16, 2008

Does XML really matter?

There is a new burst of enthusiasm for XML amongst book publishers. Mike Shatzkin, who often has cogent things to say, has produced a little encomium for XML in Publisher's Weekly.

Here's what we call the Copernican Change. We have lived all our lives in a universe where the book is “the sun” and everything else we might create or sell was a “subsidiary right” to the book, revolving around that sun.

In our new universe, the content encased in a well-formed XML file is the sun. The book, an output of a well-formed XML file, is only one of an increasing number of revenue opportunities and marketing opportunities revolving around it. It requires more discipline and attention to the rules to create a well-formed XML file than it did to create a book. But when you're done, the end result is more useful: content can be rendered many different ways and cleaved and recombined inexpensively, unlocking sales that are almost impossible to capture cost-effectively if you start with a “book.” What the Hell Is XML? Publisher's Weekly 15 Dec 08

At the risk of being taken to be the kind of oaf who burps loudly in the presence of royalty (questioning the supreme value of XML is a bit like breathing garlic all over her majesty), I am inclined to pour cold water over this.

XML has been with us for 10 years. It certainly has its uses, especially in managing large complex texts and integrating text databases. But XML has not been and is not the be-all and end-all of digital publishing. XML is a property of texts, a style of handling them for flexible representation. In the last five years (especially since Google Book Search started motoring) it has become increasingly apparent that the book-as-book is the critical output of book publishers. Indeed PDF's are still a crucial component of the book publishing process and for many of the most useful applications of the digital book, the PDF file is the crucial starting point. Copernicus, after all, was right, the sun is the centre of the solar system. Books really do matter and they are at the centre of the GBS system.

In one crucial respect XML has been and is a damagingly misleading tool for publishers (as deleterious in its effects on newspapers and magazines as on books) it has encouraged the mistaken view that text objects can only be used on the web if they are repurposed. XML was invented primarily because it was seen as a flexible way of 'marking up' the incredibly diverse world of print in ways that could be reconciled with HTML and the web. Everything printed would be repurposed for the web and XML would facilitate this step. This now looks like it may not be an efficient way to look at things. Google Book Search and other digital representation platforms are showing us that repurposing a book or a magazine is not necessary and usually results in the loss of important information. It is certainly a mistake to suppose the XML is necessary if books are to be effectively used in the web or in databases -- as Google Book Search, the largest print database, demonstrates. Above all, XML, and any particular implementation of XML is only as good as the design for which it was crafted, XML is not future-proof, and it is highly misleading of Shatzkin to recommend:

"You'll save the most money right away if you create many books that are similar in structure and thus can be rendered from the same “style sheet.”
Books should only be similar in structure, and their texts should only share the same style sheet, if they are similar in purpose. A rigid XML style sheet for the whole of a publisher's list is for many publishers a lousy idea. Designing, or selecting, your books to fit your style sheet is putting the cart before your horse.


Mark said...

Again Adam you challenge and confound. I think at the end of the day the only thing that is truly future proof is text-as-txt. A scanned book needs to be OCRd and that introduces error. PDFs are representations of text, not text itself so you are doing no one any favors by locking it up in a container. XML is not perfect but at least it is extensible. I agree with you about the style sheet bit, but publishers should at least standardize on one schema.

Adrian said...

I am not entirely sure your counter-reformation moment here is entirely convincing. Many information sectors that gave up on books (so we don't think about them when these discussions arise) moved a while back, many reference publishers are post-Copernican and a growing number of academic publishers are finally doing it properly.

Reflowable text and hyperlinking are much easier to do properly when using an XML workflow - lots of publishers struggle with implementation but that is just part of the game...