Where does XHTML diverge from XML?

XHTML is similar to XML, except when it comes to character sets. Be wary when developing.

This Content Component encountered an error

Where does XHTML diverge from XML?
Ed Tittel

By default, XHTML follows the same behavior for character sets as did its predecessor, HTML. That is, XHTML assumes that the character set used is ISO-Latin-1 unless some other set is specifically invoked, through one of two particular mechanisms. Because of its roots in HTML -- or more properly, because of limitations on character sets present in most modern Web browsers -- XHTML presently follows its HTML ancestry when it comes to assigning (or assuming) character sets, rather than using the more capable and open-ended character sets available to more conventional XML applications.

In other words, even though XHTML is definitely an XML application, support for character sets is one small area where XHTML currently deviates from standard XML practice. I review how HTML and XML handle character set assignments, then explain which character sets are currently available to XHTML, and how to invoke them.

The HTML Specification requires HTML to use the ISO-Latin character sets in general, and the ISO-Latin-1 character set by default. Basically, the ISO-Latin character sets define 8-bit character codes where the first 127 characters (7 bits' worth, in other words) invariably map to standard 7-bit ASCII character codes. In ISO-Latin character sets, higher order character codes (those numbered 128 through 255, with a 1 in the highest bit position) map to other characters but do not coincide with higher-order ASCII characters.

The ISO standard that governs these character sets is numbered 8859. Table 1 lists the most common ISO-Latin character sets.

Table 1 ISO-Latin Character Sets
ISO NAME VERSIONNAME DESCRIPTION/LANGUAGES SUPPORTED
ISO-8859-1 Latin-1 Albanian, Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, Flemish, Galician, German, Icelandic, Irish, Italian, Norwegian, Portugese, Scottish, Spanish, and Swedish. Missing Dutch and French ligatures, German quotation marks.
ISO-8859-2 Latin-2 Croatian, Czech, English, German, Hungarian, Polish, Romanian, Serbian, Slovak, and Slovene.
ISO-8859-3 Latin-3 English, Esperanto, Galician German, and Maltese.
ISO-8859-4 Latin-4 English, German, Greenlandish, Lappish, Latvian, and Lithuanian. Now superseded by ISO-Latin-10 (Latin 6).
ISO-8859-5 Unnamed ASCII, with Cyrillic characters used in Byelorussian, Bulgarian, Macedonian, Russian, Serbian, and Ukrainian.
ISO-8859-6 Unnamed ASCII plus Arabic characters.
ISO-8859-7 Unnamed ASCII plus Greek characters.
ISO-8859-8 Unnamed ASCII plus Hebrew characters.
ISO-8859-9 Latin-5 Matches Latin-1, but replaces Icelandic letters with Turkish letters.
ISO-8859-10 Latin-6 English, Icelandic, Inuit, Lappish, Latvian, and Lithuanian.
ISO-8859-11 Unnamed ASCII plus Thai characters.
ISO-8859-12 Latin-7 Celtic and English.
ISO-8859-13 Latin-8 English plus Baltic Rim languages.
ISO-8859-14 Latin-9 English and Sami (Lappish).
ISO-8859-15 Latin-10 Variant of Latin-1 that includes the Euro currency symbol, plus accented French and Finnish letters.

Note than when an ISO-Latin character set has a version name, either its ISO Name or its version name may be used to invoke that set in an HTTP header or inside a meta element within an HTML or XHTML document. Also note that older character sets -- especially those associated with ISO 2022--may also be invoked in HTML or XHTML documents (this is valuable for Japanese and other alphabets not supported in ISO 8859).

There are three ways to invoke character sets in HTML:

  • By default: if no explicit character set invocation occurs, ISO-Latin-1 is used by default.
  • In an HTTP 1.0/1.1 header: Web content developers won't be directly involved with this usage, since it requires creating specific entries in various Web server configuration files or related MIME types files to establish an alternate default to ISO-Latin-1.
  • Using a specially formatted meta element: content developers invoke an alternate character set as part of the metadata in the head portion of an HTML or XHTML document, as follows (note that XHTML empty element syntax is used):
    <meta 
       http-equiv="Content-Type"
       content="text/html; charset=ISO-8859-10" />
    

In essence, the 2nd and 3rd alternatives are the same, except that emission of an HTTP header is something the Web server handles, and insertion of a meta element in an HTML or XHTML document is something a content developer can handle. But because the Web server reads that meta data and uses it to create an equivalent HTTP header, the two approaches produce exactly the same results.

XML neatly sidesteps the need to track multiple ISO-Latin character sets (though it is capable of supporting them) through its support for numerous 8- and 16-bit versions of Unicode, also known as ISO/IEC 10646. Valid XML processors must at least be able to handle UTF-8 and UTF-16, the 8 and 16 bit encodings of the Universal Transformation Format associated with Unicode and ISO/IEC 10646.

Because of Unicode's vast representational power -- it can represent 65,536 different characters, including alphabetic characters, ideographs, diacritical marks, and so forth -- it readily accommodates an extremely broad range of elements. Of the 65,536 values that Unicode can represent, over 40,000 values are occupied. Most of these are used to support nearly 20,000 Han (Chinese) ideographs and over 11,000 elements of the Korean Hangul syllabary. The remaining 9,000 characters support most of the world's other known languages.

But until Web browsers truly become XML-literate, and can deal with real XML enoding schemes, XHTML is stuck with the older HTML method for handling such encodings. How long that will take is anybody's guess, so be patient!


Have questions, comments, or feedback about this or other XML-related topics? Please e-mail me at tips@searchmiddleware.com; I'm always glad to hear from you.

Ed Tittel is a principal at LANWrights, Inc., a wholly owned subsidiary of LeapIt.com. LANWrights offers training, writing, and consulting services on Internet, networking, and Web topics (including XML and XHTML), plus various IT certifications (Microsoft, Sun/Java, and Prosoft/CIW).

Related Book

XHTML by Chelsea Valentine and Chris Minnick
Online Price: $39.99
Publisher Name: New Riders
Date published: January 2001
Summary:
If you're a Web developer who has worked with HTML, you will find much of XHTML instantly familiar and readily usable. However, there are parts of XHTML that are derived from XML, which may be unfamiliar and perhaps a bit harder to understand. In XHTML, Chelsea Valentine and Chris Minnick provide the explanations and explorations that will help you become familiar and comfortable with the "X" in XHTML.


This was first published in September 2001

Dig deeper on XML and XML schema

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchSoftwareQuality

SearchCloudApplications

SearchAWS

TheServerSide

SearchWinDevelopment

Close