
XML DEVELOPER
Untangling Unicode encoding in XML
Ed Tittel 01.15.2003
Rating: --- (out of 5)




XML Developer Tip
(Receive this column in your inbox,
click Edit your Profile to subscribe.)
Untangling Unicode encoding in XML
Ed Tittel
My last tip dealt with including the Euro currency symbol in XML documents, using various forms of Unicode character entity references. It caused an unexpected blizzard of e-mail asking for help on managing the details involved in working with the many different forms that Unicode can take. This led me on an expedition to locate good references and tutorials on the subject, which in turn led me to the subject of this week's tip. It's a profound bow of gratitude toward Mike J. Brown's excellent Web resource entitled "The skew.org XML Tutorial". This paper concentrates on matters related not just to XML in general, but also on XML encoding strategies. It also covers the differences between Unicode (which is bandied about—as I've done here—as a way of describing a mammoth collection and codification of character codes, alphabets, and other typographical marks)—and the standard that actually governs XML character encoding—namely, ISO/IEC Standard 10646-1. Brown cuts through these matters by calling this a Universal Character Set or UCS.
The biggest practical difference between the two standards is that the Unicode Standard is available online at www.unicode.org and is well and affordably documented in Addison-Wesley's various versions of the Unicode Consortium's excellent publications, of which the most current version is The Unicode Standard 3.0 (Addison-Wesl
To continue reading for free, register below or login
To read more you must become a member of SearchSOA.com
');
// -->

ey, 2000). The ISO/IEC 10646-1 official documentation comes in numerous pieces—as many as six, in fact—and costs hundreds of dollars and up for electronic, CD, or paper copies available only from the Web site at www.iso.org. Brown also recommends Tony Graham's Unicode: A Primer (Wiley, 2000) as another valuable resource on the topic, one that explains the differences between Unicode and ISO 10646 more thoroughly than his tutorial, in fact.
Brown's tutorial does numerous wonderful things to help XML content and tool developers fit their minds around the many minutia of getting Unicode/10646 encoding right in the XML documents and in the tools that deal with such documents, including:
By working your way through this excellent collection of materials, you should be much better equipped to understand and use UCS encodings in your XML documents. Having worked around the topic for nearly 5 years now, I nevertheless learned a lot about UCS encodings from this resource myself; hopefully, you will have the same experience.
About the Author
[IMAGE]Ed Tittel is a principal at LANWrights, Inc., a network-oriented writing, training, and consulting firm based in Austin, Texas. He is the creator of the Exam Cram series and has worked on over 30 certification-related books on Microsoft, Novell, and Sun related topics. Ed teaches in the Certified Webmaster Program at Austin Community College and consults. He a member of the NetWorld + Interop faculty, where he specializes in Windows 2000 related courses and presentations.
For More Information:
 |

|
|
 |
|
 |