It’s 14 years since Tim-Berners Lee first proposed the hypertext model that eventually spawned HTML and the Web, and some developers think it’s time to move away from HTML-based information. Author of Dreamweaver MX e-Learning Toolkit, explains with concrete examples why HTML isn't good enough anymore, and why XML and other markup languages need to replace HTML.
In the spring of 1989, Tim Berners-Lee, a software engineer at CERN (the world's largest particle physics lab, based in Europe), proposed a global hypertext system to allow researchers from around the world to easily share information using networked computers. Berners-Lee authored a document titled Information Management: A Proposal, where he proposed that CERN "work toward a universal linked information system.”
Berners-Lee got approval to work on developing such a system, which he initially named "Mesh." Over two years, he developed a set of rules for delivering such information via the Internet called HTTP (HyperText Transfer Protocol), and a markup language for encoding that content called HTML (HyperText Markup Language).
By the summer of 1991, "Mesh" (now delivered as software he named WorldWideWeb) was made available for the burgeoning Internet. By 1994, Berners-Lee left CERN to found the World Wide Web Consortium (www.w3.org) at MIT. It’s this consortium that’s chartered with defining specifications for the Web, including the HTML markup language and the HTTP protocol.
You can see Berners-Lee's original proposal to CERN at www.w3.org/History/1989/proposal.rtf.
Basics of HTML
When Berners-Lee initially envisioned the Web, he certainly didn't envision the omnipresent entity it eventually became. His initial proposal for the HTML markup language consisted of a simple set of elements (tags) that would encapsulate and define a document's content by its structure.
A few examples of structural elements for content include paragraphs, headings, lists, images, and, most importantly for the Web, hyperlinks. To understand how HTML operates, let's take a look at a sample paragraph and numbered list from A History of Science by Henry Smith Williams (published in 1904):
Receptivity is the first prerequisite to progressive thinking, and that Thales reached out after and imbibed portions of Oriental wisdom argues in itself for the creative character of his genius. Whether borrower or originator, however, Thales is credited with the expression of the following geometrical truths:
That the circle is bisected by its diameter.
That the angles at the base of an isosceles triangle are equal.
That when two straight lines cut each other the vertical opposite angles are equal.
That the angle in a semicircle is a right angle.
That one side and one acute angle of a right-angle triangle determine the other sides of the triangle.
To identify the paragraph, HTML requires you to encapsulate that content with a tag that identifies the beginning of the paragraph (<P>), and another tag that identifies the end of the paragraph (</P>), as follows:
<P>Receptivity is the first prerequisite to progressive thinking, and that Thales reached out after and imbibed portions of Oriental wisdom argues in itself for the creative character of his genius. Whether borrower or originator, however, Thales is credited with the expression of the following geometrical truths:</P>
To identify the numbered list, HTML requires you to encapsulate that content with a tag that identifies the beginning of an ordered list (<OL>), and another tag that identifies the end of that ordered list (</OL>). Further, HTML requires you to separately identify the beginning of each list item (<LI>) and the end of each list item (</LI>), as follows:
<LI>That the circle is bisected by its diameter. </LI>
<LI>That the angles at the base of an isosceles triangle are equal. </LI>
<LI>That when two straight lines cut each other the vertical opposite angles are equal. </LI>
<LI>That the angle in a semicircle is a right angle. </LI>
<LI>That one side and one acute angle of a right-angle triangle determine the other sides of the triangle. </LI>
HTML typically delineates each structural element of an HTML document using container elements as shown above. That is, an element "contains" the content by encapsulating it with begin and end tags for the element.
HTML also includes a few empty elements, which typically insert external content or format instructions at their exact location in the HTML file, and as such, they don’t require an ending tag. Examples of empty elements include:
images, which are inserted using the empty element <IMG>
horizontal rules, which are inserted using the empty element <HR>
line breaks, which are inserted using the empty element <BR>.
The World Wide Web consortium (W3C), led by Berners-Lee, took responsibility for defining the HTML markup language. As Web-based information underwent explosive growth in the mid 1990s, so too did the demand for rendering HTML-based information. To keep up with this demand, the W3C continued to develop the HTML language, adding new HTML elements such as tables, frames, and layers.
To standardize the language, the W3C published various HTML specifications that described the structural elements for the language. Of course, each iteration of the language meant that software supporting HTML-based information also needed to be updated. That meant that browsers (such as Netscape Navigator and Internet Explorer) and authoring tools (such as Macromedia Dreamweaver and Microsoft FrontPage) needed frequent development to support additional HTML elements.
By December of 1999, the consortium finalized the current version of the HTML Specification, Version 4.01.
You can read the specification at www.w3.org/TR/REC-html40.
Understanding a DTD
Long before the Web was a gleam in Tim Berners-Lee's eye, the publishing world was driven by a publishing tool standard called SGML (Standard Generalized Markup Language). Basically, the model of SGML called for content to be structured using markup elements and, more importantly, that the rules for that markup language be stored in a separate document called a DTD (document type definition).
Berners-Lee followed this model when developing the HTML markup language. In fact, according to the HTML specification, the very first line of every HTML file is supposed to be the pointer to the HTML DTD, similar to the following (you can see this pointer if you look at the HTML source code in a browser or Web authoring tool):
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
The DTD defines the rules for each element for that version of HTML, as well as any attributes (parameters) the element may have. For example, below is the portion of the HTML 4.01 DTD that defines the rules for the <IMG> element, which is the element used in HTML to include images:
|<!ELEMENT IMG - O EMPTY
||-- Embedded image -->|
||-- %coreattrs, %i18n, %events --|
||-- URI of image to embed --|
||-- short description --|
||-- link to long description|
(complements alt) --
||-- name of image for scripting --|
||-- override height --|
||-- override width --|
||-- use client-side image map --|
||-- use server-side image map --|
Notice that in the DTD the element has the characteristic EMPTY; that means that the element doesn’t have a closing tag. Also notice that the attribute list shows a number of attributes. Some of those attributes are required (such as src and alt), and other attributes are implied (optional), such as width and height. An example use of the <IMG> empty element is:
<IMG src="images/earth.jpg" alt="Picture of Earth" height="300" width="300">
Although there are billions of HTML pages on the Internet, those pages are defined by a mere 91 HTML elements in the HTML DTD. Because the W3C defines the HTML DTD, they decide all of the rules for the markup language. And that begs the question: what if your organization wants to speak a different or more specific markup language?
You can review the HTML DTD at www.w3.org/TR/html4/strict.dtd. You can review the specific list of 91 HTML elements in the HTML specification at www.w3.org/TR/html401/index/elements.html.
So Why Isn't HTML Enough?
To understand why HTML isn't good enough, let's look at a content example we're all familiar with: a newspaper. Assume you want to publish newspaper content on the Web. How would you encode that content today using HTML?
Newspapers have many common structural elements that are characteristic of newspaper publishing, including headlines, bylines, and photo captions (and many, many others). Further, assume you want to identify each bit of content using structural elements that are specific to that content. How would you identify those structural elements using HTML? You might arbitrarily use
a first-level heading to identify the headline
a paragraph to identify the byline
bolded text to identify a photo caption.
Example coding might resemble the following:
<H1>US Military Releases Photos of Hussein's Sons</H1>
<P><B>By Michael Doyle</B></P>
<P>July 25, 2003</P>
<P>In a highly unusual move, the US military released photographic evidence of the bodies of Saddam Hussein's sons, Uday and Qusay, to unequivocally demonstrate to the people of Iraq that the dictator's sons were in fact killed by US military forces.</P>
<IMG src="hussein-bodies.jpg" alt="Photograph of the Bodies of Uday and Qusay">
<P><B>The Bodies of Uday and Qusay</B></P>
The problem with HTML is that it’s a generic markup language employing only the most basic structural elements in publishing. Any application of that markup language to industry-specific content is necessarily arbitrary.
The answer to the problem is XML
Although HTML wasn't completely developed until 1999, the W3C recognized early-on the severe limitations of the language. In 1996, the W3C determined that Web publishing was severely limited by HTML, and created a working group to define the Web's next generation of markup language.
That new markup language was called the eXtensible Markup Language, or XML. The core difference between XML and HTML is that the markup language for XML can be defined by any tool or any application by creating elements in a custom DTD (or by using XML-specific constructs called schema). By allowing this extensibility, different industries and applications can develop a custom markup language that suits their specific needs.
With XML, the content for our sample article might now resemble the following:
<HEADLINE>US Military Releases Photos of Hussein's Sons</HEADLINE>
<BYLINE>By Michael Doyle</BYLINE>
<DATELINE>July 25, 2003</DATELINE>
<LEAD_PARAGRAPH>In a highly unusual move, the US military released photographic evidence of the bodies of Saddam Hussein's sons, Uday and Qusay, to unequivocally demonstrate to the people of Iraq that the dictator's sons were in fact killed by US military forces.</LEAD_PARAGRAPH>
<PHOTO_CAPTION>The Bodies of Uday and Qusay</PHOTO_CAPTION>
By allowing custom element definitions, the markup of content within XML documents can completely focus on the function of that content.
Furthermore, because elements focus only on function, XML is conducive to dynamic Web publishing from information stored in databases. Using HTML, you can't create a direct relationship between elements and database fields because you’re limited to using the 91 elements defined by the HTML DTD. Using XML, you can create Web publishing elements that exactly correlate to fields and records in databases.
Because XML can be industry and application specific, there are many different groups defining XML specific to their needs. For example, the wireless communications industry is developing an XML-based markup language called the Wireless Markup Language; the chemical industry is developing the Chemical Markup Language; the real estate industry is developing the Real Estate Listing Markup Language. And, yes, the news industry is developing an XML-based DTD called the News Industry Text Format.
With the essential advantages that XML offers, there’s no doubt that it will dominate the Web horizon…although it may take some time for tools and applications to catch up to protocols and specifications.
You can find a free tutorial on XML created by IBM at www-106.ibm.com/developerworks/
edu/x-dw-xmlintro-i.html?view_by=Introduction+to+XML. You can find a review of industry XML standards at www.oasis-open.org/cover/xml.html#applications. Finally, you can review the W3C XML efforts at www.w3.org/XML.
Published: September 8, 2003