A matter of XML character

A matter of XML character
By Ed Tittel

Given that the XML specification says that XML is able to use the ISO 10646 Universal Multiple-Octet Coded Character set--a.k.a. the Universal Character Set (UCS), known as Unicode as well--I thought it might be interesting to cover some of the terminology and usage issues that working with Unicode in XML documents can sometimes entail. Please bear with me, were about to dive into a large bowl of alphabet soup full of acronyms of all kinds!

XML must deal with the following two forms of Unicode text encoding based on a technique called the Universal Transformation Format (UTF):

  1. UTF-16: The default way to encode Unicode characters is a 16-bit encoding. Using this technique, most characters are assigned a unique 16-bit value, called a character code. Unicode 16-bit encoding is the same as the ISO/IEC 10646 UTF-16 transformation format. When using UTF-16, characters with code values from 0 to 65,535 are encoded as single 16-bit values; characters with code values of 65,536 or greater are encoded as pairs of 16-bit values called surrogates. Basically, these exist to extend the space available to Unicode to a total of 31 bits' worth of data, which is currently believed to be sufficient for capturing all the world's known alphabets, glyphs, and ideograms) Using 4-byte codes, or extended UTF-16 surrogates, requires a 4-byte Universal Character Set (UCS) encoding called UCS-4; I mention this only

    Requires Free Membership to View

  1. because if you want to use it, DTD extensions written for the WebSGML Adaptions to ISO Standard 8879 (which define the ISO-Latin-1 through ISO-Latin-12 character sets) must be incorporated, so that the DTDs can legally contain numeric character codes big enough to represent 4-byte encodings.
  2. UTF-8: This technique provides a variable-length, byte-oriented way to encode character data, designed specifically for compatibility with ASCII based computing systems. Essentially, UTF preserves ASCII encodings for all character codes that are 7 bits in length or less.

    Integrating Unicode characters outside the ASCII character set boundary of 0 to 126 requires that such characters not only be encoded into a sequence of anywhere from one to four bytes in length, but also that the values in those bytes be managed to properly convey the underlying data in an encoded form (for the details on the translation algorithm used, consult pg. 47 of The Unicode Standard Version 3.0, by the Unicode Consortium, Addison-Wesley.

Most XML, XHTML, or HTML documents that invoke Unicode based encoding schemes use the UTF-8 transformation format by default (this is the assumed encoding scheme if no explicit alternate encoding scheme is included in a document's XML declaration). It's important to note that UTF-8 is incompatible with so-called "higher-order" ASCII characters (those with character codes from 127 to 255). Fortunately, this means you can still use the same character entities you may have learned while using HTML and ISO-Latin-1. It also means you should become accustomed to using Unicode character codes for character entities, which you can create as å or as �x00E5; to produce the lowercase a with a ring above it (the ISO-Latin-1 character entities for this are å and å).

For more great information on this topic, please visit one or more of the following online resources:

  • The Unicode Consortium operates an extremely informative Web site at http://www.unicode.org/. You can find access to all kinds of specifications, technical information, and character set displays here.
  • Dave Johnson at Boston University has posted an incredibly dense but informative resource called the "ISO 10646 Dictionary" wherein he defines all kinds of related terms, acronyms, and specifications. Check it out at http://cns-web.bu.edu/pub/djohnson/web_files/i18n/ISO-10646.html.
  • Back in 1997, Rick Jelliffe created a DTD that defines named character entities for SGML or XML documents that use ISO-10646 character encodings. This is a useful external document to include in your work should you wish to take advantage of these definitions. You'll find the DTD at http://www.oasis-open.org/cover/xml-ISOents.txt

Although this may seem entirely anti-climactic, it's all this capability that lies behind the simple statement in an XML declaration that might read:
<?xml version="1.0" standalone="yes" encoding="UTF-8">
Now you know what lurks behind the final attribute value and can better appreciate what depths of representation it can deliver!

Ed Tittel is a principal at LANWrights, Inc., a wholly owned subsidiary of LeapIt.com. LANWrights offers training, writing, and consulting services on Internet, networking, and Web topics (including XML and XHTML), plus various IT certifications (Microsoft, Sun/Java, and Prosoft/CIW).

This was first published in January 2001

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.