Special Edition Using HTML 4

- 33 -
Understanding XML

by Luke Andrew Cassady-Dorion

A New Language for the Web

Since HTML made its smashing debut a few years ago, it has seen numerous revisions. Each revision brought more features to the specification, which may, in turn, be implemented in a given browser. More often than not, a browser implements not only most of the features defined by the specification, but also a series of its own proprietary tags. The Netscape layer tag is an example.

In looking at the revision problem that plagued HTML, it became necessary to develop a new markup language that could be implemented once and never need revisions. Instead of looking to a new mark-up language, the engineers at the W3C decided to look back to the roots of HTML.

As most of you remember, HTML is a very slimmed-down version of SGML, which is a feature-rich language for marking up data. SGML as it currently exists is a rather good language for marking up data. Unfortunately, it is not geared for deployment in a network environment. Where SGML is a good extensible language, HTML is a simple non-extensible language. As a middle point with the network-centric non-extensible HTML on one end and the non-network-centric extensible SGML on the other, the engineers at the W3C developed eXtensible Markup Language (XML).

XML as envisioned by the W3C will exist in only one form for as long as the Web exists. It is designed to allow developers to dynamically describe the information stored in a Web page. By making Web pages self-describing, it will be possible to not only have Web browsers accessing the Web, but also for developers to write custom search tools which scour the Web for specific information.

In designing XML, the W3C has taken into account ten design goals. These goals define a plan for a markup language that is better than HTML in that it fixes the evolution and compatibility problems, and is better than SGML in that it's geared for Internet deployment and is easier to use.

The W3C defines the XML's goals as follows (taken from http://www.w3.org/TR/WD-xml-lang):

XML shall be straightforwardly usable over the Internet.
XML shall support a wide variety of applications.
XML shall be compatible with SGML.
It shall be easy to write programs that process XML documents.
The number of optional features shall be kept to the absolute minimum, ideally zero.
XML documents shall be human-legible and reasonably clear.
The XML design should be prepared quickly.
The design of XML shall be formal and concise.
XML documents shall be easy to create.
Terseness in XML markup is of minimal importance.

This chapter will examine where XML as a language is heading. The chapter will give an overview of the technology, discuss its uses, and also some areas where it is not of use.

XML as a Metalanguage

XML is what is referred to as a metalanguage, or a language for describing other languages. In your case, XML allows you to create documents that describe themselves to their reader. While some may attempt to say that the same is true for HTML, HTML lacks an easily expandable vocabulary. In order to expand on HTML, a proposal needs to be submitted to and approved by the W3C. However, to expand on XML, one simply has to use the new descriptors in an XML file. For example, take a look at the XML fragment in Listing 33.1:

Listing 33.1 Sample XML Describing a Listing of Books

<heading>Great San Francisco Books</heading>
<title>Tales of the City</title>
<author>Armistead Maupin</author>
<title>The Vampire Lestat</title>
<author>Anne Rice</author>
<title>Access San Francisco Restaurants</title>
<author>Graceann Walden</author>

The most obvious thing in Listing 33.1 is that these tags, while HTML-like, are not in any way approved by the W3C. This could obviously create problems to applications that you want to work with the data.

Though HTML does describe to a browser the manner in which the HTML page should be displayed on screen, that is not an XML document's purpose. An XML document simply serves to describe the data contained in the files. When an XML page is sent to a browser for on-screen display, it usually arrives with a style sheet or Document Type Definition (DTD), which tells the browser how to display the text. What's important is that XML does not simply make on-screen display easier, it also simplifies the job of applications, such as search engines. Because the XML in Listing 33.1 describes to the reader the location of all authors, search engine applications can now easily index the document by author.

Creating XML

An explanation of what XML is and what needs it provides for are covered in the previous sections. In this next section, you dive in and take a look at what is required of an XML document.

In its raw form, XML looks very similar to HTML, as you can see in Listing 33.1. The languages sport a similar look because of their common ancestor, SGML. XML and HTML have many functional differences.

The first, most obvious difference is that XML tag structure is very rigid. In HTML, there are tags that always have an opening and closing tag pair (<CENTER></CENTER>), tags that stand alone ( ), and tags that do either (, or simply ). To further confuse the tag situation, most browsers will attempt to display even incorrect HTML. For example, if you are missing a </TABLE> tag Navigator will usually still display the table, while Internet Explorer will not. This browser work-around may make a lot of Web pages look better, but since incorrect HTML is still displayed, sloppy HTML coding is also encouraged.

In contrast, XML requires that all tags either exist in pairs, or announce to the reader that a closing tag is not present. For example, the   tag rendered in XML appears either as   or  . Note that when the   tag stands alone, it ends with a trailing slash, indicating the lack of a closing tag.

In addition to those requirements, XML also requires that all attribute values occur in quotation marks. For example, the following tag pair is incorrect: <COLOR value=red></COLOR>. Instead, opt for the slightly different: <COLOR value="red"></COLOR>. HTML originally asked the same of authors. However, it seems that over time, authors stopped using quotation marks and browsers stopped requiring them.

Finally, XML allows no illegal nesting of tags. This means that for every open tag, its closing tag must appear at an unambiguous location. For example, in Listing 33.2, there may be some confusion regarding which question is being closed by the first and then the second </question> tags.

Listing 33.2 Invalid XML tag structure

<title>Coversation</title>
<question>What is the average flying speed of a swallow?
<question>What kind of swallow?</question></question>

Listing 33.3 contains a better option.

Listing 33.3 Valid XML tag structure

<title>Coversation</title>
<question>What is the average flying speed of a swallow? </question>
<question>What kind of swallow?</question>

Creating Valid/Well-Formed Pages

One of the keys to creating XML pages is knowing the rules the reader applications use when evaluating a given page. Because these rules are strictly defined (and hopefully enforced), it is actually rather easy to create documents that follow all the required rules. In fact, at the end of this chapter there are URLs to three publicly available parsers that you can use to test your currently developing XML.

When developing an XML file, that file can be defined as either valid, well-formed, or both. Valid XML files are those that have and follow a given Document Type Definition (DTD).

NOTE: A DTD is any number of files that contain a formal definition of a given type of document. Because they have origins in SGML, there are already thousands of DTDs. However, one can easily create a new DTD if the document requires it. For examples of existing DTDs, see "More Information," in this chapter.

When distributed, an XML document will provide a link to a DTD in its header. An example header (with a dummy URL) is contained in Listing 33.4.

Listing 33.4 XMO document header

<?XML VERSION="1.0"?>
<!doctype silly system "http://www.dtdDomain.com/file.dtd">

In contrast to the valid XML file, you can also create a well-formed XML file. A well-formed XML file is one that can be used without a DTD. While a DTD is not required, a well-formed XML file must use the heading shown in Listing 33.5. A well-formed file must also follow the tag and attribute rules specified earlier in the chapter.

Listing 33.5 Header for a well-formed XML document

<?XML VERSION="1.0" RMD="NONE">.

Moving from HTML to XML

Most of you have spent significant time developing HTML files. You need to know how to convert those HTML files to XML files. Since a lot of the transfer (  to   for example) can be automated, you most likely want to incorporate the changes into some sort of script (Applescript, Sed, and the like) and perform a batch transfer of all your HTML files.

You will have little trouble converting those files to XML if you have carefully developed HTML. However, sloppy HTML needs some fixing due to XML's strong typing. The first step in converting an HTML page to an XML page is making sure the page is well-formed (see this chapter's section titled "Creating Valid/Well-Formed Pages"). After this is done, you need to add a DTD to the XML document's header and ensure that it references one of the available HTML DTDs. The HTML DTD tells the reader application how to deal with each of the tags that are part of the HTML specification. An example DTD is <!DOCTYPE HTML SYSTEM "http://www.domain.com/dtds/html.dtd">.

More Information

XML is definitely something to get excited about. In the next year--assuming all goes well--we will begin to see Web pages which actually describe themselves to their readers. To keep track of new developments regarding XML, watch the following links and mailing lists.

Parsers (or reader applications):

Norbert Mikula's NXP at http://www.edu.uni-klu.ac.at/~nmikula/NXP/.
Tim Bray's Lark at http://www.textuality.com/Lark/.
Sean Russell's kernel at http://jersey.uoregon.edu/ser/software/XML.tar.gz.
The Microsoft XML parser http://www.microsoft.com/standards/xml/xmldl.htm.

Web sites/mailing lists:

The XML FAQ at http://www.ucc.ie/xml.
The XML specification at http://www.w3.org/pub/WWW/TR/.
Available SGML DTDs at http://www.ucc.ie/cgi-bin/PUBLIC/.
The XML mailing list archived at http://www.lists.ic.ac.uk/hypermail/xml-dev/.