[License-discuss] The underappreciated merits of HTML

Wed Sep 5 02:12:16 UTC 2012

I received two emails today both referring to HTML, and it seemed to me
that they only required a single answer, so I'm taking the unusual step
of cross-posting to two unrelated lists.  Follow-ups will presumably
land on whichever list you are on.

On public-microxml, Uche Ogbuji wrote:

> I am not so convinced that people will suddenly start using HTML as
> their tag lingua franca in MicroXML.  If they did, they would more
> likely just skip MicroXML altogether and stick to an HTML toolchain.
> I think we can have human-readable documents in the vocab of choice in
> MicroXML and then have them transformed to or dressed up as HTML at
> the edges of the toolchain.  That's the predominant approach today.
> There is very little use of XHTML, even XHTML5.  Data people use XML
> assembled from their DBMS and fling it at XSLT.  Content people use
> richer vocabularies (e.g. DITA, Docbook, etc.), or wizards that do the
> same under the bonnet.

On license-discuss, Larry Rosen wrote:

> [C]onverting to plain text destroys information useful for human
> beings to comprehend the license. It is like removing indentation and
> line endings from source code. Please don't encourage old-fashioned
> ways of representing licenses so they can't be easily read by the
> only ones that matter: Human beings.  This is part of my existential
> battle, including within Apache, to acknowledge that HTML allows for
> a richer vocabulary of expression. Quit down-versioning our creative
> works. :-)

HTML as a format has suffered so dreadfully from its abuse that HTML as a
vocabulary has, I believe, been downgraded as well.  As Uche says, people
with a lot of documents to deal with tend to treat HTML as a pure output.
It has become a fundamentally binary format, as uneditable as PDF and
as opaque as Word 97 format, and I think that's really unfortunate.

This bias is so pervasive that once when I was working on an XML document
format, I suggested the reuse of simple HTML element names like p,
blockquote, em, strong, etc. on the grounds that they would be familiar
to anyone working with the format.  This was immediately shot down by
the rest of the team, on the grounds that the users would assume the
document format was HTML and try to use it as such.

However, they were so vehement about it that I think the unexpressed
subtext was, "If it looks like HTML, the customers will treat us as
HTML monkeys instead of document type designers.  We have to make it
look different so they'll know it's Real XML."  Indeed, I take this
opportunity to praise the DITA creators for having the courage to reuse
HTML names in their document-oriented standard.

Similarly, when I was working at Reuters Health, all our HTML output
was in fact XHTML, so when people asked us for an XML format, I urged
them to get the HTML and feed it into their XML toolchain.  "No, no,
that's HTML; we want XML."  "It *is* XML, well-formed XML, all of it."
"You don't understand.  We want XML, *not* HTML." ~~ /me grinds teeth ~~

I think that one of the things MicroXML may be able to provide
is a revitalization of HTML the vocabulary as a reasonable choice
for the construction and maintenance of straightforward documents.
It's really not so bad for writing simple uncomplicated documents like
software licenses or W3C standards -- indeed, I wrote the XML Infoset
Recommendation entirely in HTML.

Of course, I'm the guy who put together the Itsy Bitsy Teeny
Weeny Simple Hypertext DTD, so you'd expect me to say that.
See http://www.ccil.org/~cowan/ibtwsh6.rnc (or .rng or .dtd).

-- 
There are three kinds of people in the world:   John Cowan
those who can count,                            cowan at ccil.org
and those who can't.